Data matrix#

Data matrix implementation.

class tidyms2.core.matrix.DataMatrix(samples, features, data, validate=True, status=None)#

Bases: object

Storage class for matrix data.

Parameters:
  • samples (Sequence[Sample]) – the list of samples in the data matrix. Each sample is associated with a matrix row.

  • features (Sequence[FeatureGroup]) – the list of features in the data matrix. Each feature is associated with a matrix column.

  • data (ndarray[tuple[Any, ...], dtype[TypeVar(FloatDtype, bound= floating)]]) – A 2D numpy float array with matrix data. The number of rows and columns must match the samples and features length respectively.

  • validate (bool) – If set to True will assume that the input data is sanitized. Otherwise, will validate and normalize data before creating the data matrix. Set to True by default.

class IO(matrix)#

Bases: object

Manage export and import of a data matrix in a variety of formats.

features_to_csv(path)#

Write feature metadata into a csv file.

Return type:

None

features_to_dict()#

Export feature metadata into a dataframe-friendly dictionary format.

Return type:

dict

classmethod from_csv(samples_csv, matrix_csv, features_csv)#

Create a data matrix instance from csv data.

Return type:

DataMatrix

matrix_to_csv(path)#

Write data matrix into a csv file.

Return type:

None

matrix_to_dict()#

Export data matrix into a dataframe-friendly dictionary format.

Return type:

dict

samples_to_csv(path)#

Write sample metadata into a csv file.

Return type:

None

samples_to_dict()#

Export sample metadata into a dataframe-friendly dictionary format.

Return type:

dict

class Metrics(matrix)#

Bases: object

Define data matrix metrics computation.

correlation(field, method=CorrelationMethod.PEARSON)#

Compute the correlation coefficient between features and a sample metadata field.

Return type:

ndarray[tuple[Any, ...], dtype[TypeVar(FloatDtype, bound= floating)]]

cv(robust=False)#

Compute features coefficient of variation (CV).

\[\textrm{CV} = \frac{\bar{X}}{S}\]

where \(S\) is the sample standard deviation and \(\bar{X}\) is the sample mean

Parameters:

robust (bool) – If set to True will use the sample median absolute deviation and median instead of the standard deviation and mean.

Return type:

ndarray[tuple[Any, ...], dtype[TypeVar(FloatDtype, bound= floating)]]

Returns:

a dictionary where the keys are group values for each matrix partition and the values are the CV estimation for each feature in the group. NaN values will be obtained if all values in the column are zero or NaN. If robust is set to False and the number less than two values in the column are not NaN, a NaN value will also be obtained.

Raises:

ValueError – if a sample does not contain a metadata field defined in groupby.

detection_rate(threshold=0.0)#

Compute the detection rate of features (DR).

Return type:

ndarray[tuple[Any, ...], dtype[TypeVar(FloatDtype, bound= floating)]]

dratio(sample_groups=None, qc_groups=None, robust=False)#

Compute the D-ratio metric for columns.

The D-ratio is defined as the quotient between the standard deviation of QC data, or data that is expected to exhibit instrumental variation only and the standard deviation of sample data or data that presents biological variation.

\[\textrm{D-Ratio} = \frac{S_{\textrm{QC}}}{S_{\textrm{sample}}}\]

where \(S_{\textrm{sample}\) is the sample standard deviation and \(S_{\textrm{QC}\) is the QC standard deviation.

a D-ratio of 0.0 means that the technical variance is zero, and all observed variance can be attributed to a biological cause. On the other hand, a D-Ratio of 1.0 or larger, means that the observed variation is mostly technical.

NaN values in the sample or QC data will be ignored in the computation of the standard deviation.

Parameters:
  • sample_groups (list[str] | None) – a list of sample groups with biological variation. If not provided, uses all samples with sample type SampleType.SAMPLE.

  • qc_groups (list[str] | None) – a list of sample groups with instrumental variation only. If not provided, uses all samples with sample type SampleType.TECHNICAL_QC.

  • robust (bool) – if set to True estimate the D-ratio using the median absolute deviation instead.

Return type:

ndarray[tuple[Any, ...], dtype[TypeVar(FloatDtype, bound= floating)]]

Returns:

an 1D array with the D-ratio of each column. Columns with constant sample values will result in Inf. If both sample and QC columns are constant, the result will be NaN. If robust is set to True and there are less than two non NaN values in either Xqc or Xs columns, NaN values will also be obtained.

lod()#

Compute the limit of detection (LOD) using blank samples.

The limit of detection is defined as:

\[\textrm{LOD} = \bar{X}_{\textrm{blank}} + 3 * \bar{S}_{\textrm{blank}}\]

where \(\bar{X}_{\textrm{blank}}\) is the feature mean in the blank samples and \(\bar{S}_{\textrm{blank}}\) is the sample standard deviation of blanks.

Return type:

ndarray[tuple[Any, ...], dtype[TypeVar(FloatDtype, bound= floating)]]

Returns:

an array with the LOD of each feature. If the LOD cannot be estimated because there are no blank samples or the blank contains only missing values, it with return zero instead.

loq()#

Compute the limit of quantification (LOQ) using blank samples.

The limit of quantification is defined as:

\[\textrm{LOQ} = \bar{X}_{\textrm{blank}} + 10 * \bar{S}_{\textrm{blank}}\]

where \(\bar{X}_{\textrm{blank}}\) is the feature mean in the blank samples and \(\bar{S}_{\textrm{blank}}\) is the sample standard deviation of blanks.

Return type:

ndarray[tuple[Any, ...], dtype[TypeVar(FloatDtype, bound= floating)]]

Returns:

an array with the LOQ of each feature. If the LOQ cannot be estimated because there are no blank samples or the blank contains only missing values, it with return zero instead.

pca(*, n_components=2, normalization=None, scaling=None, return_loadings=False, return_variance=False)#

Compute the PCA scores and loading of the data matrix.

Parameters:
  • n_components (int) – the number of Principal Components to compute.

  • scaling (ScalingMethod | str | None) – the scaling method applied to X columns. Refer to ScalingMethod for the list of available scaling methods. If set to None, no scaling is applied.

  • normalization (NormalizationMethod | str | None) – One of the available normalization methods. Refer to NormalizationMethod for the list of available normalization methods. If set to None, do not perform row normalization.

  • return_loadings (bool) – wether to return the PCA feature loadings.

  • return_variance (bool) – wether to return the PC variance.

  • kwargs – params passed to select_samples() method to perform a PCA analysis with a subset of samples.

Returns:

A 2D array with PCA scores. If return_loadings is set to True also include a 2D array with the PCA loadings. If return_pc_variance is set to True also include a 1D array with PC variances.

class Query(matrix)#

Bases: object

Query API for selecting sample subsets using sample metadata.

Note that the filter and group_by methods are implemented using using pure Python. If performance is required, consider using the sql() method, which allows to query sample metadata and feature using a DuckDB SQL backend.

fetch_sample_ids()#

Execute a sample query.

Return type:

list[tuple[Sequence[str], list[str]]]

filter(**kwargs)#

Select samples based on metadata fields.

Parameters:

kwargs – key-value pairs used to select samples. Keys must be a SampleMetadata field. If a scalar value is passed, it is compared for equality with each sample metadata. If an list or tuple is passed, then the metadata field is checked for membership in the iterable. If multiple key-value pairs are provided, samples must pass checks for all pairs.

Return type:

Self

group_by(*args)#

Group samples based on metadata fields.

Parameters:

args (str) – the list of SampleMetadata fields used to create groups.

Return type:

Self

sql(stmt)#

Query data matrix metadata using SQL syntax.

Parameters:

stmt (str) – the SQL statement to query data.

add_columns(*columns)#

Add columns to the data matrix.

Parameters:

features – the list of columns to add

Return type:

None

check_status()#

Check and update the data matrix status.

Return type:

None

classmethod combine(*matrices)#

Combine multiple matrices into a single data matrix.

All matrices are assumed to have the same feature groups.

Return type:

Self

create_submatrix(sample_ids=None, feature_groups=None)#

Create a submatrix using a subset of samples and/or features.

Return type:

Self

property features: Sequence[FeatureGroup]#

The list of features in the matrix.

get_columns(*groups)#

Retrieve columns from the data matrix.

Parameters:

groups (int) – the feature groups associated with each column. If no groups are provided then all groups are retrieved.

Return type:

list[FeatureVector]

get_data(sample_ids=None, feature_groups=None)#

Retrieve the matrix data in numpy format.

Each rows in the array is associated with a sample and each column is associated with a feature.

Parameters:
  • sample_ids (list[str] | None) – if provided, return a copy of the data array using the subset of samples provided.

  • feature_groups (list[int] | None) – if provided, return a copy of the data array using the subset of feature provided.

Return type:

ndarray[tuple[Any, ...], dtype[TypeVar(FloatDtype, bound= floating)]]

get_feature(group)#

Retrieve a group from the data matrix.

Parameters:

group (int) – the group label of the feature to retrieve

Raises:

FeatureGroupNotFound – if the feature is not found in the data matrix.

Return type:

FeatureGroup

get_feature_index(*groups)#

Retrieve the list of indices in the data associated with feature groups.

Parameters:

groups (int) – the list of feature groups to search

Return type:

list[int]

get_n_features()#

Retrieve the number of feature groups in the data matrix.

Return type:

int

get_n_samples()#

Retrieve the number of samples in the data matrix.

Return type:

int

get_process_status()#

Retrieve the current data matrix status.

Return type:

DataMatrixProcessStatus

get_rows(*ids)#

Retrieve rows from the data matrix.

Parameters:

ids (str) – the sample ids associated with each row

Return type:

list[SampleVector]

get_sample(sample_id)#

Retrieve a sample from the data matrix.

Parameters:

sample_id (str) – the id of the sample to retrieve

Raises:

SampleNotFound – if no sample with the provided id exists in the matrix

Return type:

Sample

get_sample_index(*sample_ids)#

Retrieve the list of indices in the data associated with samples.

Parameters:

sample_ids (str) – the list of samples to search

Return type:

list[int]

has_feature(group)#

Check if a feature group is stored in the matrix.

Parameters:

group (int) – the feature group to check

Return type:

bool

has_sample(sample_id)#

Check if a sample is stored in the matrix.

Parameters:

sample_id (str) – the sample id to check

Return type:

bool

property io: IO#

Matrix IO methods getter.

list_features()#

List all features in the data matrix.

Return type:

Sequence[FeatureGroup]

list_sample_field(field)#

Retrieve the field value from all samples.

If a sample does not contain the queried field, it returns None.

Parameters:

field (str) – the field name to fetch

Return type:

list

list_samples()#

List all samples in the data matrix.

Return type:

Sequence[Sample]

property metrics: Metrics#

Matrix metrics method getter.

property query: Query#

Matrix query methods getter.

remove_features(*groups)#

Remove feature groups based on their groups labels.

Parameters:

groups (int) – the group labels to remove

Return type:

None

remove_samples(*ids)#

Remove samples with based on their ids.

Parameters:

ids (str) – the list of sample ids to remove

Return type:

None

property samples: Sequence[Sample]#

The list of samples in the matrix.

set_columns(*pairs)#

Set column values in the data matrix.

Parameters:

pairs (tuple[int, ndarray[tuple[Any, ...], dtype[TypeVar(FloatDtype, bound= floating)]]]) – a tuple consisting of in a feature group and the corresponding column data.

Return type:

None

set_data(data)#

Set all values in the data matrix.

Return type:

None

set_rows(*pairs)#

Set row values in the data matrix.

Parameters:

pairs (tuple[str, ndarray[tuple[Any, ...], dtype[TypeVar(FloatDtype, bound= floating)]]]) – a tuple consisting of in a sample id and the corresponding column data.

Return type:

None

property status: DataMatrixProcessStatus#

Data matrix status getter.

validate()#

Perform a sanity check and normalization of the data matrix.

Return type:

None

pydantic model tidyms2.core.matrix.FeatureVector#

Bases: BaseVector

Data matrix column view.

field feature: FeatureGroup [Required]#

The feature information associated with the matrix column.

pydantic model tidyms2.core.matrix.SampleVector#

Bases: BaseVector

Data matrix row.

field sample: Sample [Required]#

The sample associated with the row.