Working with raw data#

TidyMS2 provides utilities to read raw data in the mzML format. Refer to this guide for details on converting files in proprietary, instrument specific formats to mzML.

The first core component that we will present is MSData. This class provides fast index-based access to spectra and chromatogram data stored in a raw data file. The following code snippet downloads an example mzML file and creates an MSData using it:

import pathlib

from tidyms2 import MSData
from tidyms2.io.datasets import download_dataset

# you may need to provide a GitHub personal access token to download the dataset or
# download the data manually from
# https://github.com/griquelme/tidyms-data/tree/master/test-nist-raw-data
data_dir = pathlib.Path("data")
token = None
download_dataset("test-nist-raw-data", data_dir, token=token)

filename = "NZ_20200227_039.mzML"
data = MSData(data_dir / filename)

print(f"{filename} contains {data.get_n_spectra()} spectra.")

Individual MSSpectrum can be retrieved by index, i.e., the order in which spectra appear in the raw data file:

index = 5
spectrum = data.get_spectrum(index)

MSSpectrum is one of the data models that we mentioned above and it acts as a container for m/z and spectral intensity in a single data scan.

print(spectrum.mz[:10])

MSData acts also as an spectra iterator, allowing to get all data from a file:

for sp in data:
    print(sp.index)

It is often the case that it is only required to iterate over a specific time window or MS level. This information can be passed to the MSData constructor, but this information is better stored using the Sample data model. This model uniquely defines a measurement and it will be a core part to define datasets. A sample points to the raw data file associated with a measurement, but it also includes information from MS level, start and end time and other metadata. The following code snippet reads data defined using a sample model:

from tidyms2 import Sample

sample = Sample(id="sample-39", path=data_dir / filename, ms_level=1, start_time=120.0, end_time=150.0)
data = MSData(sample)

for sp in data:
    print(f"Current scan time is {sp.time:.2f} s. MS level is {sp.ms_level}.")

The MSData reads data in a lazy manner, i.e., only the required scans are loaded into memory. In the default configuration, once read, data is cached in to memory for faster retrieval. The MSData cache size can be configured to limit the amount of memory used:

filename = "NZ_20200227_039.mzML"
data = MSData(data_dir / filename, cache= 50 * 1024**2)   # maximum cache size of 50 MiB

Finally, a centroider function may be plugged into an MSData instance to convert profile data to centroid mode as it is fetched from disk:

TODO: complete