.. _algorithms-peak-extraction:

Peak extraction
===============

This guide describes theoretical background of the :term:`feature` extraction algorithm for LC-MS
datasets, implemented by the :py:class:`~tidyms2.lcms.LCPeakExtractor` operator.

In LC-MS datasets, features are represented as peaks, which  are extracted from :term:`m/z`
traces, as described in the :ref:`algorithms-roi-extraction` guide.

A peak is represented by a start, apex and end, as shown in the following figure:

..  plot:: plots/peak_definition.py
    :caption: Peak start, apex and end.

In the first release of TidyMS, peak picking worked using a modified version of the CWT algorithm,
described in [1]_. In chromatographic data, and in particular in untargeted datasets, optimizing
the parameters to cover the majority of peaks present in the data can be a tricky process. Some of
the problems that may appear while using the CWT algorithm are:

1.  sometimes when a lot of peak overlap occurs, peaks are missing. This is because peaks are
    identified as local maximum in the ridge lines from the wavelet transform. If the widths
    selected don't have the adequate resolution, this local maximum may not be found. Also, it
    is possible to have more than one local maximum in a given ridgeline, which causes to
    select one of them using ad hoc rules.
2.  The Ricker wavelet is the most used wavelet to detect peaks, as it has been demonstrated
    to work very with gaussian peaks. In LC-MS data, is common to find peaks with a certain
    degree of asymmetry (eg. peak tailing). Using the Ricker wavelet in these cases, results
    in a wrong estimation of the peak extension, which in turn results in bad estimates for
    some descriptors such as peak area or peak width.
3.  The interaction between the parameters in the CWT algorithm is rather complex, and
    sometimes it is not very clear how they affect the peak picking process. The user must
    have a clear knowledge of the wavelet transform to interpret parameters such as the SNR.
    Also there are a lot of specific parameters to tune the detection of the ridgelines.

These reasons motivated us to replace the CWT peak picking function with a local maxima based
peak estimation. The current algorithm works as follows:

1.  The noise level in the m/z trace is estimated.
2.  Using the noise level estimation, each point in the m/z trace is classified as baseline o
    signal. Baseline points are interpolated to build the m/z trace baseline.
3.  Peaks apex are detected using :py:func:`scipy.signal.find_peaks`. Peaks with a prominence
    lower than three times the noise level or in regions classified as baseline are removed.
4.  For each peak its extension is determined by finding the closest baseline point to its left
    and right.
5.  If there are overlapping peaks (overlapping peak extensions), the extension is fixed by
    defining a boundary between the peaks as the minimum value between the apex of the two peaks.

We now describe each step in detail.

Noise estimation
----------------

To estimate the noise and baseline, the discrete signal :math:`x[n]` is modelled as three additive
components:

.. math::
    x[n] = s[n] + b[n] + e[n]

:math:`s` is the peak component, which is deterministic, non negative and small except regions
where peaks are present. The baseline :math:`b` is a smooth slow changing function. The noise
term :math:`e[n]` is assumed to be independent and identically distributed (iid) samples from a
gaussian distribution :math:`e[n] \sim N(0, \sigma)`.

If we consider the second finite difference of :math:`x[n]`, :math:`y[n]`:

.. math::
    y[n] = x[n] - 2 x[n-1] + x[n-2]

As :math:`b` is a slow changing function we can ignore its contribution. We expect that the
contribution from :math:`s` in the peak region is greater than the noise contribution, but if we
ignore higher values of :math:`y` we can focus on regions where :math:`s` is small we can say that
most of the variation in :math:`y` is due to the noise:

.. math::
    y[n] \approx (e[n] - 2 e[n-1] + e[n-2])

Within this approximation, we can say that :math:`y[n] \sim N(0, 2\sigma)`. The noise estimation
tries to exploit this fact, estimating the noise from the standard deviation of the second
difference of :math:`x`. The algorithm used can be summarized in the following steps:

1.  Compute the second difference of :math:`x`, :math:`y`.
2.  Set :math:`p=90`, the percentile of the data to evaluate.
3.  compute :math:`y_{p}` the p-th percentile of the absolute value of :math:`y`.
4.  Compute the mean :math:`\overline{y}` and standard deviation :math:`S_{y}` of :math:`y`
    restricted to elements with an absolute value lower than :math:`y_{p}`. This removes the
    contribution of :math:`s`.
5.  If :math:`|\overline{y}| \leq S_{y}` or :math:`p \leq 20` then the noise level is
    :math:`\sigma = 0.5 S_{y}`. Else decrease :math:`p` by 10 and go back to step 3.

If the contribution from :math:`s` is not completely removed, the noise estimation will be biased.
Despite this, this method gives a good enough approximation of the noise level that can be used to
remove noisy peaks.

Baseline  estimation
--------------------

Baseline estimation is done with the following approach: first, every point in :math:`x` is
classified as signal if a peak can potentially be found in the region or as or as baseline
otherwise. Then, the baseline is estimated for the whole signal by interpolating baseline points.

The main task of baseline estimation is then to perform this classification process. To do this,
all local extrema in the signal are searched (including first and last points). Then, we take all
closed intervals defined between consecutive local maxima and minima (or vice versa) and try to
evaluate if there is a significant contribution to the signal coming from :math:`s` in each
interval. If :math:`j` and :math:`k` are the indices defining one such interval, then the sum of
:math:`x` in the interval is:

.. math::
    \sum_{i=j}^{k}x[i] = \sum_{i=j}^{k} s[i] + b[i] + e[i]

If :math:`l = k - j` is the length of the interval, and assuming that :math:`b` is constant in the
interval we can write:

.. math::
    \sum_{i=j}^{k} x[i] - x[j] = \sum_{i=j}^{k} s[i] - s[j] +
    \sum_{i=j}^{k} e[i] -e[j]

.. math::
    a = \sum_{i=j}^{k} x[i] - x[j] = \sum_{i=j}^{k} s[i] - s[j] + e_{sum}

Where :math:`e_{sum} \sim N(0, \sqrt{2l}\sigma)` (we know :math:`\sigma` from the noise estimation).
We can get an idea of the contribution of :math:`s` by using the value of :math:`a` as follows: If
the signal term is contributing to :math:`a`, then the probability of obtaining a value greater than
:math:`a` from noise is going to be small. This can be computed in the following way:

.. math::
    P(|e_{sum}| > |a|)= \textrm{erfc} \left (\frac{|a|}{2\sqrt{l}\sigma}
    \right )

An interval is classified as baseline if this probability is greater than 0.05.

The following figure shows the result of the peak picking algorithm with different :term:`SNR`
levels, baseline shapes and peak widths.

..  plot:: plots/peak_extraction_example.py
    :caption: Peak detection and baseline estimation in noisy signals.

References
----------

..  [1] Pan Du, Warren A. Kibbe, Simon M. Lin, Improved peak detection in mass spectrum by
    incorporating continuous wavelet transform-based pattern matching, Bioinformatics, Volume
    22, Issue 17, 1 September 2006, Pages 2059-2065, https://doi.org/10.1093/bioinformatics/btl355