RNApysoforms.read_expression_matrix
- RNApysoforms.read_expression_matrix(expression_matrix_path: str, metadata_path: str | None = None, expression_measure_name: str = 'counts', cpm_normalization: bool = False, relative_abundance: bool = False, gene_id_column_name: str | None = 'gene_id', transcript_id_column_name: str = 'transcript_id', metadata_sample_id_column: str = 'sample_id') DataFrame[source]
Loads and processes an expression matrix, optionally merging with metadata, performing CPM normalization, and calculating relative transcript abundance.
This function reads an expression matrix file and, optionally, a metadata file, merging the two on a specified sample identifier column. It supports performing Counts Per Million (CPM) normalization and calculating relative transcript abundance based on gene counts. The resulting DataFrame is returned in long format, including the expression measures, optional CPM values, relative abundances, and metadata if provided.
Required Columns in Expression Matrix: - transcript_id_column_name (default “transcript_id”): Identifier for each transcript. - If gene_id_column_name is provided and not None, it must be a column in the expression matrix.
Supported File Formats: - .csv, .tsv, .txt, .parquet, .xlsx for both expression matrix and metadata files.
- Parameters:
expression_matrix_path (str) – Path to the expression matrix file. Supported file formats include .csv, .tsv, .txt, .parquet, and .xlsx.
metadata_path (str, optional) – Path to the metadata file. If provided, the metadata will be merged with the expression data on the specified sample identifier column. Supported file formats are the same as for expression_matrix_path. Default is None.
expression_measure_name (str, optional) – The name to assign to the expression measure column after melting. This will be the name of the column containing the expression values in the long-format DataFrame. Default is “counts”.
cpm_normalization (bool, optional) – Whether to perform Counts Per Million (CPM) normalization on the expression data. If True, CPM values will be calculated for each sample. Default is False.
relative_abundance (bool, optional) – Whether to calculate relative transcript abundance based on gene counts. Requires gene_id_column_name to be provided and not None. Default is False.
gene_id_column_name (str, optional) – The name of the column in the expression DataFrame that contains gene identifiers. This column will remain fixed during data transformation. If provided and relative_abundance is True, relative transcript abundance will be calculated. Default is “gene_id”. If set to None, the gene identifier will not be used.
transcript_id_column_name (str) – The name of the column in the expression DataFrame that contains transcript identifiers. This parameter is required and cannot be None. Default is “transcript_id”.
metadata_sample_id_column (str, optional) – Column name in the metadata DataFrame that identifies samples. This column is used to merge the metadata and expression data. Default is “sample_id”.
- Returns:
A Polars DataFrame in long format containing the expression data, and optionally CPM values, relative abundances, and metadata.
- Return type:
pl.DataFrame
- Raises:
ValueError – If transcript_id_column_name is None. If required feature ID columns are missing in the expression matrix. If the expression columns are not numeric. If required columns are missing in the metadata DataFrame. If there are no overlapping sample IDs between expression data and metadata. If the file format is unsupported or the file cannot be read.
- Warns:
UserWarning – If relative_abundance is True but gene_id_column_name is None. If there is partial overlap of sample IDs between expression data and metadata.
Examples
Load an expression matrix, perform CPM normalization, calculate relative transcript abundance, and merge with metadata: >>> from RNApysoforms import read_expression_matrix >>> df = read_expression_matrix( … expression_matrix_path=”counts.csv”, … metadata_path=”metadata.csv”, … expression_measure_name=”counts”, … cpm_normalization=True, … relative_abundance=True … ) >>> print(df.head())
Notes
The transcript_id_column_name is set to “transcript_id” by default. The parameter is required and cannot be None.
The function supports multiple file formats (.csv, .tsv, .txt, .parquet, .xlsx) for both expression and metadata files.
If CPM normalization is performed, the expression measures will be scaled to reflect Counts Per Million for each sample.
If gene_id_column_name is provided and relative_abundance is True, relative transcript abundance is calculated as (transcript_expression / total_gene_expression) * 100. If the total gene counts are zero, the relative abundance is set to zero to avoid division by zero errors.
Warnings are raised if there is only partial sample overlap between expression data and metadata.
The resulting DataFrame is returned in long format, with expression measures, CPM values, relative abundance for each sample-feature combination.
The expression_measure_name allows customization of the name of the expression values column in the long-format DataFrame.
If a metadata file is passed, the function expects that the values in the metadata_sample_id_column from the metadata file will be found as column names in the counts matrix file.
Beware of using the cpm_normalization and relative_abundance options set to True when working with a non-raw (i.e., normalized) counts matrix as those results may not be accurate causing misinterpretation.