Data matrix#
Data matrix implementation.
- class tidyms2.core.matrix.DataMatrix(samples, features, data, validate=True, status=None)#
Bases:
objectStorage class for matrix data.
- Parameters:
samples (
Sequence[Sample]) – the list of samples in the data matrix. Each sample is associated with a matrix row.features (
Sequence[FeatureGroup]) – the list of features in the data matrix. Each feature is associated with a matrix column.data (
ndarray[tuple[Any,...],dtype[TypeVar(FloatDtype, bound=floating)]]) – A 2D numpy float array with matrix data. The number of rows and columns must match the samples and features length respectively.validate (
bool) – If set toTruewill assume that the input data is sanitized. Otherwise, will validate and normalize data before creating the data matrix. Set toTrueby default.
- class IO(matrix)#
Bases:
objectManage export and import of a data matrix in a variety of formats.
- features_to_csv(path)#
Write feature metadata into a csv file.
- Return type:
None
- features_to_dict()#
Export feature metadata into a dataframe-friendly dictionary format.
- Return type:
dict
- classmethod from_csv(samples_csv, matrix_csv, features_csv)#
Create a data matrix instance from csv data.
- Return type:
- matrix_to_csv(path)#
Write data matrix into a csv file.
- Return type:
None
- matrix_to_dict()#
Export data matrix into a dataframe-friendly dictionary format.
- Return type:
dict
- samples_to_csv(path)#
Write sample metadata into a csv file.
- Return type:
None
- samples_to_dict()#
Export sample metadata into a dataframe-friendly dictionary format.
- Return type:
dict
- class Metrics(matrix)#
Bases:
objectDefine data matrix metrics computation.
- correlation(field, method=CorrelationMethod.PEARSON)#
Compute the correlation coefficient between features and a sample metadata field.
- Return type:
ndarray[tuple[Any,...],dtype[TypeVar(FloatDtype, bound=floating)]]
- cv(robust=False)#
Compute features coefficient of variation (CV).
\[\textrm{CV} = \frac{\bar{X}}{S}\]where \(S\) is the sample standard deviation and \(\bar{X}\) is the sample mean
- Parameters:
robust (
bool) – If set toTruewill use the sample median absolute deviation and median instead of the standard deviation and mean.- Return type:
ndarray[tuple[Any,...],dtype[TypeVar(FloatDtype, bound=floating)]]- Returns:
a dictionary where the keys are group values for each matrix partition and the values are the CV estimation for each feature in the group. NaN values will be obtained if all values in the column are zero or NaN. If robust is set to
Falseand the number less than two values in the column are not NaN, a NaN value will also be obtained.- Raises:
ValueError – if a sample does not contain a metadata field defined in groupby.
- detection_rate(threshold=0.0)#
Compute the detection rate of features (DR).
- Return type:
ndarray[tuple[Any,...],dtype[TypeVar(FloatDtype, bound=floating)]]
- dratio(sample_groups=None, qc_groups=None, robust=False)#
Compute the D-ratio metric for columns.
The D-ratio is defined as the quotient between the standard deviation of QC data, or data that is expected to exhibit instrumental variation only and the standard deviation of sample data or data that presents biological variation.
\[\textrm{D-Ratio} = \frac{S_{\textrm{QC}}}{S_{\textrm{sample}}}\]where \(S_{\textrm{sample}\) is the sample standard deviation and \(S_{\textrm{QC}\) is the QC standard deviation.
a D-ratio of 0.0 means that the technical variance is zero, and all observed variance can be attributed to a biological cause. On the other hand, a D-Ratio of 1.0 or larger, means that the observed variation is mostly technical.
NaN values in the sample or QC data will be ignored in the computation of the standard deviation.
- Parameters:
sample_groups (
list[str] |None) – a list of sample groups with biological variation. If not provided, uses all samples with sample typeSampleType.SAMPLE.qc_groups (
list[str] |None) – a list of sample groups with instrumental variation only. If not provided, uses all samples with sample typeSampleType.TECHNICAL_QC.robust (
bool) – if set toTrueestimate the D-ratio using the median absolute deviation instead.
- Return type:
ndarray[tuple[Any,...],dtype[TypeVar(FloatDtype, bound=floating)]]- Returns:
an 1D array with the D-ratio of each column. Columns with constant sample values will result in
Inf. If both sample and QC columns are constant, the result will beNaN. If robust is set toTrueand there are less than two nonNaNvalues in either Xqc or Xs columns, NaN values will also be obtained.
- lod()#
Compute the limit of detection (LOD) using blank samples.
The limit of detection is defined as:
\[\textrm{LOD} = \bar{X}_{\textrm{blank}} + 3 * \bar{S}_{\textrm{blank}}\]where \(\bar{X}_{\textrm{blank}}\) is the feature mean in the blank samples and \(\bar{S}_{\textrm{blank}}\) is the sample standard deviation of blanks.
- Return type:
ndarray[tuple[Any,...],dtype[TypeVar(FloatDtype, bound=floating)]]- Returns:
an array with the LOD of each feature. If the LOD cannot be estimated because there are no blank samples or the blank contains only missing values, it with return zero instead.
- loq()#
Compute the limit of quantification (LOQ) using blank samples.
The limit of quantification is defined as:
\[\textrm{LOQ} = \bar{X}_{\textrm{blank}} + 10 * \bar{S}_{\textrm{blank}}\]where \(\bar{X}_{\textrm{blank}}\) is the feature mean in the blank samples and \(\bar{S}_{\textrm{blank}}\) is the sample standard deviation of blanks.
- Return type:
ndarray[tuple[Any,...],dtype[TypeVar(FloatDtype, bound=floating)]]- Returns:
an array with the LOQ of each feature. If the LOQ cannot be estimated because there are no blank samples or the blank contains only missing values, it with return zero instead.
- pca(*, n_components=2, normalization=None, scaling=None, return_loadings=False, return_variance=False)#
Compute the PCA scores and loading of the data matrix.
- Parameters:
n_components (
int) – the number of Principal Components to compute.scaling (
ScalingMethod|str|None) – the scaling method applied to X columns. Refer toScalingMethodfor the list of available scaling methods. If set toNone, no scaling is applied.normalization (
NormalizationMethod|str|None) – One of the available normalization methods. Refer toNormalizationMethodfor the list of available normalization methods. If set toNone, do not perform row normalization.return_loadings (
bool) – wether to return the PCA feature loadings.return_variance (
bool) – wether to return the PC variance.kwargs – params passed to
select_samples()method to perform a PCA analysis with a subset of samples.
- Returns:
A 2D array with PCA scores. If return_loadings is set to
Truealso include a 2D array with the PCA loadings. If return_pc_variance is set toTruealso include a 1D array with PC variances.
- class Query(matrix)#
Bases:
objectQuery API for selecting sample subsets using sample metadata.
Note that the filter and group_by methods are implemented using using pure Python. If performance is required, consider using the
sql()method, which allows to query sample metadata and feature using a DuckDB SQL backend.- fetch_sample_ids()#
Execute a sample query.
- Return type:
list[tuple[Sequence[str],list[str]]]
- filter(**kwargs)#
Select samples based on metadata fields.
- Parameters:
kwargs – key-value pairs used to select samples. Keys must be a
SampleMetadatafield. If a scalar value is passed, it is compared for equality with each sample metadata. If an list or tuple is passed, then the metadata field is checked for membership in the iterable. If multiple key-value pairs are provided, samples must pass checks for all pairs.- Return type:
Self
- group_by(*args)#
Group samples based on metadata fields.
- Parameters:
args (
str) – the list ofSampleMetadatafields used to create groups.- Return type:
Self
- sql(stmt)#
Query data matrix metadata using SQL syntax.
- Parameters:
stmt (
str) – the SQL statement to query data.
- add_columns(*columns)#
Add columns to the data matrix.
- Parameters:
features – the list of columns to add
- Return type:
None
- check_status()#
Check and update the data matrix status.
- Return type:
None
- classmethod combine(*matrices)#
Combine multiple matrices into a single data matrix.
All matrices are assumed to have the same feature groups.
- Return type:
Self
- create_submatrix(sample_ids=None, feature_groups=None)#
Create a submatrix using a subset of samples and/or features.
- Return type:
Self
- property features: Sequence[FeatureGroup]#
The list of features in the matrix.
- get_columns(*groups)#
Retrieve columns from the data matrix.
- Parameters:
groups (
int) – the feature groups associated with each column. If no groups are provided then all groups are retrieved.- Return type:
list[FeatureVector]
- get_data(sample_ids=None, feature_groups=None)#
Retrieve the matrix data in numpy format.
Each rows in the array is associated with a sample and each column is associated with a feature.
- Parameters:
sample_ids (
list[str] |None) – if provided, return a copy of the data array using the subset of samples provided.feature_groups (
list[int] |None) – if provided, return a copy of the data array using the subset of feature provided.
- Return type:
ndarray[tuple[Any,...],dtype[TypeVar(FloatDtype, bound=floating)]]
- get_feature(group)#
Retrieve a group from the data matrix.
- Parameters:
group (
int) – the group label of the feature to retrieve- Raises:
FeatureGroupNotFound – if the feature is not found in the data matrix.
- Return type:
- get_feature_index(*groups)#
Retrieve the list of indices in the data associated with feature groups.
- Parameters:
groups (
int) – the list of feature groups to search- Return type:
list[int]
- get_n_features()#
Retrieve the number of feature groups in the data matrix.
- Return type:
int
- get_n_samples()#
Retrieve the number of samples in the data matrix.
- Return type:
int
- get_process_status()#
Retrieve the current data matrix status.
- Return type:
- get_rows(*ids)#
Retrieve rows from the data matrix.
- Parameters:
ids (
str) – the sample ids associated with each row- Return type:
list[SampleVector]
- get_sample(sample_id)#
Retrieve a sample from the data matrix.
- Parameters:
sample_id (
str) – the id of the sample to retrieve- Raises:
SampleNotFound – if no sample with the provided id exists in the matrix
- Return type:
- get_sample_index(*sample_ids)#
Retrieve the list of indices in the data associated with samples.
- Parameters:
sample_ids (
str) – the list of samples to search- Return type:
list[int]
- has_feature(group)#
Check if a feature group is stored in the matrix.
- Parameters:
group (
int) – the feature group to check- Return type:
bool
- has_sample(sample_id)#
Check if a sample is stored in the matrix.
- Parameters:
sample_id (
str) – the sample id to check- Return type:
bool
- list_features()#
List all features in the data matrix.
- Return type:
Sequence[FeatureGroup]
- list_sample_field(field)#
Retrieve the field value from all samples.
If a sample does not contain the queried field, it returns
None.- Parameters:
field (
str) – the field name to fetch- Return type:
list
- remove_features(*groups)#
Remove feature groups based on their groups labels.
- Parameters:
groups (
int) – the group labels to remove- Return type:
None
- remove_samples(*ids)#
Remove samples with based on their ids.
- Parameters:
ids (
str) – the list of sample ids to remove- Return type:
None
- set_columns(*pairs)#
Set column values in the data matrix.
- Parameters:
pairs (
tuple[int,ndarray[tuple[Any,...],dtype[TypeVar(FloatDtype, bound=floating)]]]) – a tuple consisting of in a feature group and the corresponding column data.- Return type:
None
- set_data(data)#
Set all values in the data matrix.
- Return type:
None
- set_rows(*pairs)#
Set row values in the data matrix.
- Parameters:
pairs (
tuple[str,ndarray[tuple[Any,...],dtype[TypeVar(FloatDtype, bound=floating)]]]) – a tuple consisting of in a sample id and the corresponding column data.- Return type:
None
- property status: DataMatrixProcessStatus#
Data matrix status getter.
- validate()#
Perform a sanity check and normalization of the data matrix.
- Return type:
None