RNA_pysoforms.gene_filtering

RNApysoforms.gene_filtering(target_gene: str, annotation: DataFrame, expression_matrix: DataFrame | None = None, transcript_id_column: str = 'transcript_id', gene_id_column: str = 'gene_name', order_by_expression_column: str = 'counts', order_by_expression: bool = True, keep_top_expressed_transcripts: str | int = 'all') → DataFrame | tuple[source]

Filters genomic annotations and optionally an expression matrix for a specific gene, with options to order and select top expressed transcripts.

This function filters the provided annotation DataFrame to include only entries corresponding to the specified target_gene, identified using the column specified by gene_id_column. If an expression matrix is provided, it will also be filtered to retain only the entries corresponding to the filtered transcripts based on the transcript_id_column. Additionally, it provides options to order transcripts by their total expression levels and to keep only the top expressed transcripts, specified by keep_top_expressed_transcripts.

Required Columns in `annotation` DataFrame: - gene_id_column (default “gene_name”): Column containing gene identifiers used for filtering. - transcript_id_column (default “transcript_id”): Column containing transcript identifiers.

Required Columns in `expression_matrix` DataFrame (if provided): - transcript_id_column (same as in annotation): Column containing transcript identifiers matching those in annotation. - order_by_expression_column (default “counts”): Column containing expression values used for ordering and filtering.

Parameters:

target_gene (str) – The gene identifier to filter in the annotation DataFrame.
annotation (pl.DataFrame) – A Polars DataFrame containing genomic annotations. Must include the columns specified by gene_id_column and transcript_id_column.
expression_matrix (pl.DataFrame, optional) – A Polars DataFrame containing expression data. If provided, it will be filtered to match the filtered annotation based on transcript_id_column. Default is None.
transcript_id_column (str, optional) – The column name representing transcript identifiers in both the annotation and expression matrix. Default is ‘transcript_id’.
gene_id_column (str, optional) – The column name in the annotation DataFrame that contains gene identifiers used for filtering. Default is ‘gene_name’.
order_by_expression_column (str, optional) – The column name in the expression matrix that contains expression values used for ordering and filtering. Default is ‘counts’.
order_by_expression (bool, optional) – If True, transcripts will be ordered by their total expression levels in descending order. Default is True.
keep_top_expressed_transcripts (Union[str, int], optional) – Determines the number of top expressed transcripts to keep after ordering by expression levels. Can be ‘all’ to keep all transcripts or an integer to keep the top N transcripts. Default is ‘all’.

Returns:

If expression_matrix is provided, returns a tuple of (filtered_annotation, filtered_expression_matrix).
If expression_matrix is None, returns only the filtered_annotation.

Return type:

pl.DataFrame or tuple

Raises:

TypeError – If annotation or expression_matrix are not Polars DataFrames.
ValueError – If required columns are missing in the annotation or expression_matrix DataFrames.
ValueError – If the filtered expression matrix is empty after filtering.
ValueError – If keep_top_expressed_transcripts is not ‘all’ or a positive integer.
Warning – If there are transcripts present in the annotation but missing in the expression matrix.

Examples

Filter an annotation DataFrame by a specific gene:

>>> import polars as pl
>>> from RNApysoforms.annotation import gene_filtering
>>> annotation_df = pl.DataFrame({
...    "gene_name": ["APP", "APP", "APP"],
...    "transcript_id": ["tx1", "tx2", "tx3"]
... })
>>> expression_matrix_df = pl.DataFrame({
...    "transcript_id": ["tx1", "tx2", "tx3"],
...    "counts": [300, 100, 200]
... })
>>> target_gene = "APP"
>>> filtered_annotation, filtered_expression_matrix = gene_filtering(
...    target_gene,
...    annotation_df,
...    expression_matrix=expression_matrix_df,
...    order_by_expression=True
... )

Notes

The function filters the annotation DataFrame to include only entries where gene_id_column matches target_gene.
If an expression_matrix is provided, the function filters it to include only transcripts present in the filtered annotation.
The function checks for transcripts present in the annotation but missing in the expression matrix and issues a warning for such discrepancies.
If order_by_expression is True, transcripts are ordered by their total expression levels computed from the order_by_expression_column in the expression matrix.
If keep_top_expressed_transcripts is an integer, only the top N expressed transcripts are kept after ordering.
If keep_top_expressed_transcripts is ‘all’, all transcripts are kept.
If transcripts are present in the expression matrix but not in the annotation, they are silently ignored, and only overlapping transcripts are returned without a warning.