Chemistry#

Utilities for working with chemical entities.

Refer to the use guide for an introduction an examples on how to use this package.

Constants#

  • EM : the electron mass

class tidyms2.chem.ChemicalContext(config)#

Centralize chemical information used in the chem package.

pydantic model tidyms2.chem.Element#

Store element information and its isotopes.

field isotopes: Sequence[Isotope] [Required]#

Maps mass numbers to isotopes

field name: str [Required]#

The element name

field symbol: str [Required]#

The element symbol

field z: pydantic.PositiveInt [Required]#

The atomic number

property mmi: Isotope#

Return the isotope with the lowest atomic mass.

property monoisotope: Isotope#

Return the most abundant isotope.

class tidyms2.chem.EnvelopeScorer(config, scorer=None, **kwargs)#

Rank formula candidates by comparing measured and theoretical isotopic envelopes.

Refer to the user guide for details usage instructions.

Parameters:
  • config (EnvelopeScorerConfiguration) – the envelope generator configuration

  • scorer (Optional[Callable]) –

    function that scores formula candidate envelopes. If None, the function score_envelope() is used. A custom scoring function can be passed with the following signature:

    def score(M, p, Mq, pq, **kwargs):
        pass
    

    where M and p are arrays of the formula candidates exact mass and abundances and Mq and pq are the query mass and query abundance.

kwargs :

Optional parameter to pass into the scoring function.

filter(min_M_tol, max_M_tol, p_tol, k)#

Filter values from the k-th envelope that are outside the specified bounds.

Parameters:
  • min_M_tol (float) – mass values lower than this value are filtered.

  • max_M_tol (float) – mass values greater than this value are filtered.

  • p_tol (float) – abundance tolerance

  • k (int) – the envelope index to filter values.

Notes#

Envelopes are filtered based on the following inequality. For each i-th peak the m/z tolerance is defined as follows:

\[t_{i} = t + (T - t)(1 - y_{i})\]

where \(t_{i}\) is the mass tolerance for the i-th peak, t is the min_mz_tolerance, T is the max_mz_tolerance and \(y_[i]\) is the abundance of the i-th value. Using this tolerance, an interval is built for the query mass, and candidates outside this interval are removed. This approach accounts for greater m/z errors for lower intensity peaks in the envelope.

generate_envelopes(M, p, tolerance)#

Compute isotopic envelopes for formula candidates generated using the MMI mass from the envelope.

Parameters:
  • M (Sequence[float]) – the exact mass of the envelope

  • p (Sequence[float]) – the envelope normalized abundances.

  • tolerance (float) – mass tolerance to generate formulas.

get_top_results(n=10)#

Fetch the top ranked formula candidates and their score.

Parameters:

n (int | None) – number of first n results to return. If None, return all formula candidates.

score(M, p, tol)#

Score the isotopic envelope.

The results can be recovered using the get_top_results method.

Formulas are generated assuming that the first element in the envelope is the minimum mass isotopologue.

Parameters:
  • M (Sequence[float]) – exact mass of the envelope.

  • p (Sequence[float]) – abundance of the envelope.

  • tol (float) – mass tolerance used in formula generation.

pydantic model tidyms2.chem.EnvelopeScorerConfiguration#

Store the envelope scorer configuration.

field bounds: dict[str, tuple[pydantic.NonNegativeInt, pydantic.NonNegativeInt]] [Required]#

A dictionary that maps element (eg: "C") or isotopes (eg: "13C") symbols to minimum and maximum values of formula coefficients in generated formulas. If element symbols are provided, the most abundant isotope is used for formula generation.

field context: ChemicalContextConfiguration | None = None#

The chemical context configuration. If set to None uses the default chemical context.

field max_M: pydantic.PositiveFloat [Required]#

Maximum mass value for generated formulas.

field max_length: Annotated[int] = 5#

The length of the generated envelopes.

classmethod from_chnops(m, **kwargs)#

Create a new instance with predefined bounds for CHNOPS elements.

CHNOPS bounds were computed by finding the minimum and maximum coefficient bounds for all molecules in the HMDB under a specific mass threshold. This function offers precomputed bounds using all molecules with mass values under 500, 1000, 1500 and 2000.

Parameters:
  • m (int) – maximum mass of molecules used to build bounds. Valid values are 500, 1000, 1500 or 2000.

  • kwargs – extra arguments passed to the constructor.

Return type:

Self

update_bounds(bounds)#

Update or add new bounds.

Return type:

None

class tidyms2.chem.EnvelopeValidator(config)#

Envelope validator.

Notes#

Envelope validation is performed as follows:

  1. For a query envelope mass and abundance Mq`and `pq, all formulas compatibles with the MMI are computed (see FormulaGenerator).

  2. For each i-th pair of Mq and pq, a mass tolerance and abundance tolerance is defined as follows:

    \[dM_{i} = dM^{\textrm{max}} * pq_{i} + dM^{\textrm{min}} (1 - pq_{i})\]

    Where \(dM^{\textrm{max}}\) is min_M_tol, \(dM^{\textrm{min}}\) is max_M_tol and \(pq_{i}\) is the i-th query abundance. Using the mass tolerance and abundance tolerance, candidates with mass or abundance values outside this interval are removed.

  3. The candidates that remains define a mass and abundance window for the i + 1 elements of Mq and pq. If the values fall inside the window, the i + 1 elements are validated and the procedure is repeated until all isotopologues are validated or until an invalid isotopologue is found.

filter(min_M_tol, max_M_tol, p_tol, k)#

Filter values from the k-th envelope that are outside the specified bounds.

Parameters:
  • min_M_tol (float) – mass values lower than this value are filtered.

  • max_M_tol (float) – mass values greater than this value are filtered.

  • p_tol (float) – abundance tolerance

  • k (int) – the envelope index to filter values.

Notes#

Envelopes are filtered based on the following inequality. For each i-th peak the m/z tolerance is defined as follows:

\[t_{i} = t + (T - t)(1 - y_{i})\]

where \(t_{i}\) is the mass tolerance for the i-th peak, t is the min_mz_tolerance, T is the max_mz_tolerance and \(y_[i]\) is the abundance of the i-th value. Using this tolerance, an interval is built for the query mass, and candidates outside this interval are removed. This approach accounts for greater m/z errors for lower intensity peaks in the envelope.

generate_envelopes(M, p, tolerance)#

Compute isotopic envelopes for formula candidates generated using the MMI mass from the envelope.

Parameters:
  • M (Sequence[float]) – the exact mass of the envelope

  • p (Sequence[float]) – the envelope normalized abundances.

  • tolerance (float) – mass tolerance to generate formulas.

pydantic model tidyms2.chem.EnvelopeValidatorConfiguration#

Store the envelope validator configuration.

field bounds: dict[str, tuple[pydantic.NonNegativeInt, pydantic.NonNegativeInt]] [Required]#

A dictionary that maps element (eg: "C") or isotopes (eg: "13C") symbols to minimum and maximum values of formula coefficients in generated formulas. If element symbols are provided, the most abundant isotope is used for formula generation.

field context: ChemicalContextConfiguration | None = None#

The chemical context configuration. If set to None uses the default chemical context.

field max_M: pydantic.PositiveFloat [Required]#

Maximum mass value for generated formulas.

field max_M_tol: Annotated[float] = 0.01#

Exact mass tolerance for low abundance isotopologues. If None, the parameter is set based on the mode value. See the notes for an explanation of how this value is used.

field max_length: pydantic.PositiveInt = 5#

The length of the generated envelopes.

field min_M_tol: Annotated[float] = 0.01#

Exact mass tolerance for high abundance isotopologues. If None, the parameter is set based on the mode value. See the notes for an explanation of how this value is used.

field p_tol: Annotated[float] = 0.05#

Tolerance threshold to include in the abundance results

classmethod from_chnops(m, **kwargs)#

Create a new instance with predefined bounds for CHNOPS elements.

CHNOPS bounds were computed by finding the minimum and maximum coefficient bounds for all molecules in the HMDB under a specific mass threshold. This function offers precomputed bounds using all molecules with mass values under 500, 1000, 1500 and 2000.

Parameters:
  • m (int) – maximum mass of molecules used to build bounds. Valid values are 500, 1000, 1500 or 2000.

  • kwargs – extra arguments passed to the constructor.

Return type:

Self

remove_elements_with_a_single_isotope()#

Remove elements with a single isotope from the bounds.

Return type:

None

update_bounds(bounds)#

Update or add new bounds.

Return type:

None

class tidyms2.chem.Formula(formula: dict[Isotope, int], charge: int | None, context: None = None)#
class tidyms2.chem.Formula(formula: dict[str, int], charge: int | None, context: ChemicalContext | None = None)
class tidyms2.chem.Formula(formula: str, charge: None = None, context: ChemicalContext | None = None)

Represent a chemical formula as a mapping from isotopes to formula coefficients.

Refer to the user guide for usage instructions.

Parameters:
  • formula – a string representation of a formula or a mapping from isotopes to formula coefficients

  • charge – the numerical charge of the formula

  • context (ChemicalContext | None) – the chemical context where data is fetch from. In the majority of cases this parameter may be ignored. This parameter should be set only if working with custom isotopic abundances.

>>> Formula("H2O")
Formula(H2O)
>>> Formula("(13C)O2")
Formula((13C)O2)
>>> Formula("[Cr(H2O)6]3+")
Formula([H12CrO6]3+)
>>> Formula("CH3CH2CH3")
Formula(C3H8)
>>> Formula({"C": 1, "17O": 2})
Formula(C(17O)2)
get_exact_mass()#

Compute the exact mass of the formula.

Return type:

float

Returns#

exact_mass: float

Examples#

>>> import tidyms as ms
>>> f = ms.chem.Formula("H2O")
>>> f.get_exact_mass()
18.010564684
get_isotopic_envelope(n=10, min_p=1e-10)#

Compute the isotopic envelope of the formula.

The natural abundance is assumed for each monoisotope, i.e., the most abundant isotope. If others isotopes are present in the formula, they are assumed to have abundance equal to 1. See the examples for a clarification of this.

Parameters:
  • n (int) – the number of isotopes to include in the results.

  • min_p (float) – isotopes are included until the abundance is lower than this value

Return type:

IsotopicEnvelope

If no isotopes are specified in the formula, the natural abundance is used.

>>> import tidyms as ms
>>> f = ms.chem.Formula("C6H6")
>>> print(f)
C6H6
>>> f.get_isotopic_envelope(n=3)
(array([78.04695019, 79.05033578, 80.05373322]),
 array([0.93686877, 0.06144402, 0.00168606]))

Using isotopes other than the monoisotope are treated as if they have an abundance equal to one.

>>> f = ms.chem.Formula("(13C)6(2H)6")
>>> print(f)
(13C)6(2H)6
>>> f.get_isotopic_envelope()
(array([90.10473971]), array([1.]))
get_nominal_mass()#

Compute the nominal mass of the formula.

Return type:

int

>>> import tidyms as ms
>>> f = ms.chem.Formula("H2O")
>>> f.get_nominal_mass()
18
class tidyms2.chem.FormulaGenerator(config)#

Generate sum formulas based on exact mass values.

Refer to the user guide for usage instructions.

generate_formulas(M, tolerance, min_defect=None, max_defect=None)#

Compute formulas compatibles with the given query mass.

The formulas are computed assuming neutral species. If charged species are used, mass values must be corrected using the electron mass.

Results are stored in an internal format, use results_to_array() to obtain the compatible formulas.

Parameters:
  • M (float) – query mass used for formula generation

  • tolerance (float) – mass tolerance to search compatible formulas

  • min_defect (float | None) – if provided, filter formulas with mass defects lower than this value

  • max_defect (float | None) – if provided, filter formulas with mass defects greater than this value

>>> from tidyms2.chem import FormulaGenerator
>>> fg_bounds = {"C": (0, 5), "H": (0, 10), "O": (0, 4)}
>>> fg = FormulaGenerator(fg_bounds)
>>> fg.generate_formulas(46.042, 0.005)
get_n_results()#

Retrieve the number of formulas found after a query.

Return type:

int

get_results()#

Retrieve the formula generator results in the internal format.

Return type:

dict

Returns:

a mapping of nominal masses of the results to a tuple of three arrays: 1. the row index of positive coefficients. 2. the row index of negative coefficients. 3. the number of 12C in the formula.

results_to_array()#

Convert results to an array of coefficients.

Return type:

tuple[ndarray, list[Isotope], ndarray]

Returns:

tuple containing a 2D array wit rows of formula coefficients, a list of isotopes associated with each coefficient and a 1D array with the exact mass of each formula.

>>> import tidyms as ms
>>> fg_bounds = {"C": (0, 5), "H": (0, 10), "O": (0, 4)}
>>> fg = ms.chem.FormulaGenerator(fg_bounds)
>>> fg.generate_formulas(46.042, 0.005)
>>> coeff, isotopes, M = fg.results_to_array()
pydantic model tidyms2.chem.FormulaGeneratorConfiguration#

Store the formula generator parameters.

field bounds: dict[str, tuple[Annotated[int], Annotated[int]]] [Required]#

A dictionary that maps element (eg: "C") or isotopes (eg: "13C") symbols to minimum and maximum values of formula coefficients in generated formulas. If element symbols are provided, the most abundant isotope is used for formula generation.

field max_M: Annotated[float] [Required]#

Maximum mass value for generated formulas.

classmethod from_chnops(m, **kwargs)#

Create a new instance with predefined bounds for CHNOPS elements.

CHNOPS bounds were computed by finding the minimum and maximum coefficient bounds for all molecules in the HMDB under a specific mass threshold. This function offers precomputed bounds using all molecules with mass values under 500, 1000, 1500 and 2000.

Parameters:
  • m (int) – maximum mass of molecules used to build bounds. Valid values are 500, 1000, 1500 or 2000.

  • kwargs – extra arguments passed to the constructor.

Return type:

Self

update_bounds(bounds)#

Update or add new bounds.

Return type:

None

pydantic model tidyms2.chem.Isotope#

Store Isotope mass and abundance information.

field a: pydantic.PositiveInt [Required]#

The atomic mass number

field m: pydantic.PositiveFloat [Required]#

The exact mass

field p: float [Required]#

The isotope abundance

field symbol: str [Required]#

The element symbol

field z: pydantic.PositiveInt [Required]#

The atomic number

to_str()#

Create a string representation of the isotope.

Return type:

str

property d: float#

The mass defect.

property n: int#

The number of neutrons.

class tidyms2.chem.PeriodicTable(custom_abundances=None)#

Store information from elements and their isotopes.

Parameters:

custom_abundances (dict[str, dict[int, float]] | None) – override natural abundances of isotopes. Custom abundances is a dictionary that maps atomic numbers to a dictionary of isotopes mass numbers and their abundance. If an abundance is not provided for a particular isotope it is assumed to be zero. Abundances from a set of isotopes must be normalized to one.

get_element(element)#

Fetch an element object using its symbol or atomic number.

Return type:

Element

>>> import tidyms as ms
>>> ptable = ms.chem.PeriodicTable()
>>> h = ptable.get_element("H")
>>> c = ptable.get_element(6)
get_isotope(symbol, copy=False)#

Fetch an isotope object from a string representation.

Parameters:
  • symbol (str) – a string representation of the isotope. If only the symbol is provided in the string, the monoisotope is returned.

  • copy (bool) – If set to True a new isotope instance is created.

Return type:

Isotope

>>> import tidyms as ms
>>> ptable = ms.chem.PeriodicTable()
>>> d = ptable.get_isotope("2H")
>>> cl35 = ptable.get_isotope("Cl")
is_monoisotope(isotope)#

Check if an isotope is the most abundant.

Return type:

bool

tidyms2.chem.score_envelope(M, p, Mq, pq, min_sigma_M=0.01, max_sigma_M=0.01, min_sigma_p=0.05, max_sigma_p=0.05)#

Score the similarity between two isotopes.

Parameters:
  • M (ndarray) – theoretical mass values.

  • p (ndarray) – theoretical abundances.

  • Mq (ndarray) – query Mass values

  • pq (ndarray) – query abundances.

  • min_sigma_M (float) – minimum mass standard deviation

  • max_sigma_M (float) – maximum mass standard deviation

  • min_sigma_p (float) – minimum abundance standard deviation.

  • max_sigma_p (float) – maximum abundance standard deviation.

Returns#

scorefloat

Number between 0 and 1. Higher values are related with similar envelopes.

Notes#

The query envelope is compared against the theoretical envelope assuming a likelihood approach, similar to the described in [1]. It is assumed that the theoretical mass and abundance is a normal random variable, with mean values defined by M and p and standard deviation computed as follows:

\[\sigma_{M,i} = p_{i} \sigma_{M}^{\textrm{max}} + (1 - p_{i}) \sigma_{M}^{\textrm{min}}\]

Where \(\sigma_{M,i}\) is the standard deviation for the i-th element of M, \(p_{i}\) is the i-th element of p, \(\sigma_{M}^{\textrm{max}}\) is max_sigma_M and \(\sigma_{M}^{\textrm{min}}\) is min_sigma_M. An analogous computation is done to compute the standard deviation for each abundance. Using this values, the likelihood of generating the values Mq and pq from M and p is computed using the error function.

References#