RNA_pysoforms.gene_filtering

RNApysoforms.gene_filtering(target_gene: str, annotation: DataFrame, expression_matrix: DataFrame | None = None, transcript_id_column: str = 'transcript_id', gene_id_column: str = 'gene_name', order_by_expression_column: str = 'counts', order_by_expression: bool = True, keep_top_expressed_transcripts: str | int = 'all') DataFrame | tuple[source]

Filters genomic annotations and optionally an expression matrix for a specific gene, with options to order and select top expressed transcripts.

This function filters the provided annotation DataFrame to include only entries corresponding to the specified target_gene, identified using the column specified by gene_id_column. If an expression matrix is provided, it will also be filtered to retain only the entries corresponding to the filtered transcripts based on the transcript_id_column. Additionally, it provides options to order transcripts by their total expression levels and to keep only the top expressed transcripts, specified by keep_top_expressed_transcripts.

Required Columns in `annotation` DataFrame: - gene_id_column (default “gene_name”): Column containing gene identifiers used for filtering. - transcript_id_column (default “transcript_id”): Column containing transcript identifiers.

Required Columns in `expression_matrix` DataFrame (if provided): - transcript_id_column (same as in annotation): Column containing transcript identifiers matching those in annotation. - order_by_expression_column (default “counts”): Column containing expression values used for ordering and filtering.

Parameters:
  • target_gene (str) – The gene identifier to filter in the annotation DataFrame.

  • annotation (pl.DataFrame) – A Polars DataFrame containing genomic annotations. Must include the columns specified by gene_id_column and transcript_id_column.

  • expression_matrix (pl.DataFrame, optional) – A Polars DataFrame containing expression data. If provided, it will be filtered to match the filtered annotation based on transcript_id_column. Default is None.

  • transcript_id_column (str, optional) – The column name representing transcript identifiers in both the annotation and expression matrix. Default is ‘transcript_id’.

  • gene_id_column (str, optional) – The column name in the annotation DataFrame that contains gene identifiers used for filtering. Default is ‘gene_name’.

  • order_by_expression_column (str, optional) – The column name in the expression matrix that contains expression values used for ordering and filtering. Default is ‘counts’.

  • order_by_expression (bool, optional) – If True, transcripts will be ordered by their total expression levels in descending order. Default is True.

  • keep_top_expressed_transcripts (Union[str, int], optional) – Determines the number of top expressed transcripts to keep after ordering by expression levels. Can be ‘all’ to keep all transcripts or an integer to keep the top N transcripts. Default is ‘all’.

Returns:

  • If expression_matrix is provided, returns a tuple of (filtered_annotation, filtered_expression_matrix).

  • If expression_matrix is None, returns only the filtered_annotation.

Return type:

pl.DataFrame or tuple

Raises:
  • TypeError – If annotation or expression_matrix are not Polars DataFrames.

  • ValueError – If required columns are missing in the annotation or expression_matrix DataFrames.

  • ValueError – If the filtered expression matrix is empty after filtering.

  • ValueError – If keep_top_expressed_transcripts is not ‘all’ or a positive integer.

  • Warning – If there are transcripts present in the annotation but missing in the expression matrix.

Examples

Filter an annotation DataFrame by a specific gene:

>>> import polars as pl
>>> from RNApysoforms.annotation import gene_filtering
>>> annotation_df = pl.DataFrame({
...    "gene_name": ["APP", "APP", "APP"],
...    "transcript_id": ["tx1", "tx2", "tx3"]
... })
>>> expression_matrix_df = pl.DataFrame({
...    "transcript_id": ["tx1", "tx2", "tx3"],
...    "counts": [300, 100, 200]
... })
>>> target_gene = "APP"
>>> filtered_annotation, filtered_expression_matrix = gene_filtering(
...    target_gene,
...    annotation_df,
...    expression_matrix=expression_matrix_df,
...    order_by_expression=True
... )

Notes

  • The function filters the annotation DataFrame to include only entries where gene_id_column matches target_gene.

  • If an expression_matrix is provided, the function filters it to include only transcripts present in the filtered annotation.

  • The function checks for transcripts present in the annotation but missing in the expression matrix and issues a warning for such discrepancies.

  • If order_by_expression is True, transcripts are ordered by their total expression levels computed from the order_by_expression_column in the expression matrix.

  • If keep_top_expressed_transcripts is an integer, only the top N expressed transcripts are kept after ordering.

  • If keep_top_expressed_transcripts is ‘all’, all transcripts are kept.

  • If transcripts are present in the expression matrix but not in the annotation, they are silently ignored, and only overlapping transcripts are returned without a warning.