Working with chemical data#

The chem package provide utilities to work with chemical data such as isotopes, elements and formulas. It also allows to generate formulas from exact mass, score isotopic envelopes and search isotopic envelope candidates from a list of m/z values.

Elements and isotopes#

PeriodicTable contains element and isotope information, retrieved from the Blue Obelisk Data Repository. The get_element() method returns a Element instance:

from tidyms2.chem import PeriodicTable
ptable = PeriodicTable()
oxygen = ptable.get_element("O")
print(oxygen)
# Element(O, z=8)

An element contain basic element and isotope information:

print(oxygen.z)
# 8
print(oxygen.name)
# "Oxygen"
print(oxygen.isotopes)
# (Isotope(z=8, a=16, symbol='O'), Isotope(z=8, a=17, symbol='O'), Isotope(z=8, a=18, symbol='O'))
print(oxygen.monoisotope)
# Isotope(z=8, a=16, symbol='O')
print(oxygen.get_abundances())
# ([16, 17, 18], [15.9949, 16.9991, 17.9991], [0.9976, 0.0004, 0.0020])

Isotope store exact mass, nominal mass and abundance of each isotope:

o16 = oxygen.monoisotope
print(o16)
# z=8 a=16 symbol='O'

Chemical formulas#

The Formula class allows to create chemical formulas:

from tidyms2.chem import Formula
water = Formula("H2O")
print(water)
# H2O

Formula objects can be used to compute a formula mass and its isotopic envelope:

print(water.get_exact_mass())
# 18.010564684
envelope = water.get_isotopic_envelope()
print(envelope.mz)
# [18.01056468, 19.01555724, 20.01481138, 21.02108788]
print(envelope.p)
# [9.97340572e-01, 6.09327319e-04, 2.04962911e-03, 4.71450803e-07]

Formulas can be created by passing a dictionary of element or isotopes to a formula coefficient and the numerical charge of the formula. Formulas are implemented as dictionaries of isotopes to formula coefficients, so if an element is passed, it is assumed that it is the most abundant isotope.

f = Formula({"C": 1, "13C": 1, "O": 4}, 0)
print(f)
# C(13C)O4

Isotopes can also be specified in the string format:

f = Formula("[C(13C)2H2O4]2-")
print(f)
# [C(13C)2H2O4]2-
print(f.charge)
# -2

Sum formula generation#

The FormulaGenerator generates sum formulas from a mass value. To generate formulas, the formula space must be defined by passing a dictionary of elements and their associated minimum and maximum allowed formula coefficients:

from tidyms2.chem import Formula, FormulaGenerator, FormulaGeneratorConfiguration

config = FormulaGeneratorConfiguration(
    bounds={"C": (0, 20), "H": (0, 40), "O": (0, 10), "N": (0, 5)},
    max_M=1000.0,
)
formula_generator = FormulaGenerator(config)

To generate formulas, an exact mass value must be passed, along with a tolerance to find compatible formulas. In the following code example, first a exact mass value is computed from a formula and then compatible formulas are generated:

f = Formula("C5H10O2")
M = f.get_exact_mass()
tolerance = 0.005
formula_generator.generate_formulas(M, tolerance)
coefficients, isotopes, M_coeff = formula_generator.results_to_array()

print(coefficients)
# [[ 0 10  2  4]
#  [ 3  8  3  1]
#  [ 5 10  0  2]]
print(isotopes)
# [
#   Isotope(z=6, a=12, symbol='C'),
#   Isotope(z=1, a=1, symbol='H'),
#   Isotope(z=7, a=14, symbol='N'),
#   Isotope(z=8, a=16, symbol='O')
# ]

coefficients is a 2D Numpy array where each row are matching formulas coefficients. The isotope associated with each coefficient column are stored in isotopes. Finally, a third Numpy array stores the exact mass of each matching formula.

The from_chnops() provides an simple way to create pre-configured formula generators:

config = FormulaGeneratorConfiguration.from_chnops(1000)
formula_generator = FormulaGenerator(config)

this method generates a formula space for the CHNOPS elements by finding the maximum formula coefficients of molecules in the Human Metabolome DataBase. Precomputed formula bounds are available for molecules with maximum mass values of 500, 1000, 1500 and 2000. Other element can be added using the update_bounds() method:

config = FormulaGeneratorConfiguration.from_chnops(1000)
config.update_bounds({"Cl": (0, 2)})
formula_generator = FormulaGenerator(config)

Scoring Isotopic envelopes#

Scoring measured envelopes against theoretical values is a common strategy to establish a formula candidate for an unknown compound. The EnvelopeScorer ranks compatible formulas based on the similarity with the measured envelope. As the envelope scorer uses a formula generator to generate compatible formulas, we need to provide the formula bounds. In the same way as the formula generator, the tidyms2.chem.EnvelopeScorerConfiguration.from_hmdb() provides an easy way to create a configuration:

from tidyms2.chem import EnvelopeScorer, EnvelopeScorerConfiguration, Formula

config = EnvelopeScorerConfiguration.from_chnops(500, max_length=5)
envelope_scorer = EnvelopeScorer(config)

The max_length parameter sets the maximum length of the measured envelopes to compare against theoretical values. The score() method takes a list of exact mass and abundances corresponding to the measured isotopic envelope and scores against all compatible formulas. The results can be obtained with the get_top_results() method. The following example uses the envelope of a known formula and scores compatible formulas with it:

f = Formula("C5H10O2")
envelope = f.get_isotopic_envelope(5)
mass_tolerance = 0.005

envelope_scorer.score(envelope.mz, envelope.p, mass_tolerance)

coeff, isotopes, score = envelope_scorer.get_top_results(10)
print(coeff)
# [[ 5 10  0  2  0  0]
#  [ 3  8  3  1  0  0]
#  [ 1 13  1  2  1  0]
#  [ 0 10  2  4  0  0]
#  [ 2 16  0  0  2  0]
#  [ 2 14  0  2  0  1]
#  [ 2 15  0  0  1  1]
#  [ 1  6  6  0  0  0]
#  [ 0 12  3  1  0  1]]
print(isotopes)
# [
#   Isotope(z=6, a=12, symbol='C'),
#   Isotope(z=1, a=1, symbol='H'),
#   Isotope(z=7, a=14, symbol='N'),
#   Isotope(z=8, a=16, symbol='O'),
#   Isotope(z=15, a=31, symbol='P'),
#   Isotope(z=16, a=32, symbol='S')
# ]
print(score)
# [1.000 0.356 0.140  0.0568 0.0376 0.006 0.004 0.004 0.002]

By default, the tidyms2.chem.score_envelope() function is used. The parameters used by this function can be modified by passing them to the envelope scorer constructor as keyword arguments. It is also possible to use a custom scorer. Refer to EnvelopeScorer API docs for details on this.