RNApysoforms.process_expression_matrix

RNApysoforms.process_expression_matrix(expression_df: DataFrame, metadata_df: DataFrame | None = None, expression_measure_name: str = 'counts', cpm_normalization: bool = False, relative_abundance: bool = False, gene_id_column_name: str | None = 'gene_id', transcript_id_column_name: str = 'transcript_id', metadata_sample_id_column: str = 'sample_id') → DataFrame[source]

Processes an expression matrix DataFrame, optionally merging with metadata, performing CPM normalization, and calculating relative transcript abundance.

This function takes an already-loaded Polars DataFrame containing expression data and processes it by: - Validating required columns - Optionally calculating Counts Per Million (CPM) normalization - Optionally calculating relative transcript abundance based on gene counts - Converting from wide to long format - Optionally merging with metadata

This function is useful when you have already loaded your expression data into a Polars DataFrame and want to process it without reading from a file. If you need to read from a file, use read_expression_matrix() instead.

Expression DataFrame Format Requirements: - Must be in wide format with samples as columns and transcripts as rows. - Must contain the column specified by transcript_id_column_name (default “transcript_id”). - If gene_id_column_name is provided and not None, that column must be present. - All other columns are assumed to be sample expression values and must contain numeric values.

Example expression DataFrame format:

transcript_id    gene_id    sample1    sample2    sample3
ENST0001         ENSG001    100        200        150
ENST0002         ENSG001    50         75         60
ENST0003         ENSG002    300        250        275

Metadata DataFrame Format Requirements (if provided): - Must contain the column specified by metadata_sample_id_column (default “sample_id”). - Values in this column must exactly match the sample column names in the expression DataFrame. - Can contain any number of additional metadata columns.

Example metadata DataFrame format:

sample_id    condition    batch
sample1      control      1
sample2      treated      1
sample3      control      2

Parameters:

expression_df (pl.DataFrame) – A Polars DataFrame in wide format containing expression data. Must contain transcript_id column and sample expression columns.
metadata_df (pl.DataFrame, optional) – A Polars DataFrame containing metadata. If provided, will be merged with the expression data on the specified sample identifier column. Default is None.
expression_measure_name (str, optional) – The name to assign to the expression measure column after melting. This will be the name of the column containing the expression values in the long-format DataFrame. Default is “counts”.
cpm_normalization (bool, optional) – Whether to perform Counts Per Million (CPM) normalization on the expression data. If True, CPM values will be calculated for each sample. Default is False.
relative_abundance (bool, optional) – Whether to calculate relative transcript abundance based on gene counts. Requires gene_id_column_name to be provided and not None. Default is False.
gene_id_column_name (str, optional) – The name of the column in the expression DataFrame that contains gene identifiers. This column will remain fixed during data transformation. If provided and relative_abundance is True, relative transcript abundance will be calculated. Default is “gene_id”. If set to None, the gene identifier will not be used.
transcript_id_column_name (str) – The name of the column in the expression DataFrame that contains transcript identifiers. This parameter is required and cannot be None. Default is “transcript_id”.
metadata_sample_id_column (str, optional) – Column name in the metadata DataFrame that identifies samples. This column is used to merge the metadata and expression data. Also used as the variable name when melting the expression data. Default is “sample_id”.

Returns:

A Polars DataFrame in long format containing the expression data, and optionally CPM values, relative abundances, and metadata.

Return type:

pl.DataFrame

Raises:

ValueError – If transcript_id_column_name is None. If required feature ID columns are missing in the expression DataFrame. If the expression columns are not numeric. If required columns are missing in the metadata DataFrame. If there are no overlapping sample IDs between expression data and metadata.

Warns:

UserWarning – If relative_abundance is True but gene_id_column_name is None. If there is partial overlap of sample IDs between expression data and metadata.

Examples

Process an already-loaded expression matrix DataFrame:

>>> import polars as pl
>>> from RNApysoforms import process_expression_matrix
>>>
>>> # Create a sample expression DataFrame
>>> expr_df = pl.DataFrame({
...     "transcript_id": ["tx1", "tx2", "tx3"],
...     "gene_id": ["gene1", "gene1", "gene2"],
...     "sample1": [100, 150, 300],
...     "sample2": [200, 250, 400]
... })
>>>
>>> # Process the DataFrame
>>> df = process_expression_matrix(
...     expression_df=expr_df,
...     expression_measure_name="counts",
...     cpm_normalization=True,
...     relative_abundance=True
... )
>>> print(df.head())

Process with metadata:

>>> # Create sample metadata DataFrame
>>> metadata_df = pl.DataFrame({
...     "sample_id": ["sample1", "sample2"],
...     "condition": ["control", "treated"]
... })
>>>
>>> # Process with metadata
>>> df = process_expression_matrix(
...     expression_df=expr_df,
...     metadata_df=metadata_df,
...     cpm_normalization=True,
...     relative_abundance=True
... )
>>> print(df.head())

Notes

The transcript_id_column_name is set to “transcript_id” by default. The parameter is required and cannot be None.
If CPM normalization is performed, the expression measures will be scaled to reflect Counts Per Million for each sample.
If gene_id_column_name is provided and relative_abundance is True, relative transcript abundance is calculated as (transcript_expression / total_gene_expression) * 100. If the total gene counts are zero, the relative abundance is set to zero to avoid division by zero errors.
Warnings are raised if there is only partial sample overlap between expression data and metadata.
The resulting DataFrame is returned in long format, with expression measures, CPM values, and relative abundance for each sample-feature combination.
Beware of using the cpm_normalization and relative_abundance options set to True when working with a non-raw (i.e., normalized) counts matrix as those results may not be accurate causing misinterpretation.