RNApysoforms.calculate_exon_number

RNApysoforms.calculate_exon_number(annotation: DataFrame, transcript_id_column: str = 'transcript_id') → DataFrame[source]

Assigns exon numbers to exons, CDS, and introns within a genomic annotation dataset based on transcript structure and strand direction.

This function processes a genomic annotation DataFrame to assign exon numbers to genomic features—specifically exons, CDS (Coding Sequences), and introns—based on their position within each transcript and their strand orientation. Exons are numbered sequentially within each transcript, accounting for the strand direction: numbering increases from the 5’ to 3’ end on the positive strand and decreases on the negative strand. CDS and introns are assigned exon numbers based on their overlap or adjacency to exons.

Required Columns in `annotation`: - start: Start position of the feature. - end: End position of the feature. - strand: Strand direction of the feature (“+” or “-”). - type: Type of the feature (must include “exon”. Can also include “CDS”, and/or “intron”). - transcript_id_column (default “transcript_id”): Identifier for grouping features into transcripts.

Parameters:

annotation (pl.DataFrame) – A Polars DataFrame containing genomic annotation data. Must include columns for start and end positions, feature type, strand direction, and a grouping variable (default is ‘transcript_id’). If a different grouping variable is used, specify it using the transcript_id_column parameter.
transcript_id_column (str, optional) – The column name that identifies transcript groups within the DataFrame, by default “transcript_id”.

Returns:

A Polars DataFrame that includes all original annotation data along with a new ‘exon_number’ column. This column assigns exon numbers to exons, CDS, and introns based on their order and relationships within each transcript.

Return type:

pl.DataFrame

Raises:

TypeError – If the annotation parameter is not a Polars DataFrame.
ValueError – If required columns are missing from the annotation DataFrame based on the provided parameters.

Examples

Assign exon numbers to genomic features:

>>> import polars as pl
>>> from RNApysoforms import calculate_exon_number
>>> df = pl.DataFrame({
...    "transcript_id": ["tx1", "tx1", "tx2", "tx2"],
...    "start": [100, 200, 300, 400],
...    "end": [150, 250, 350, 450],
...    "type": ["exon", "exon", "exon", "exon"],
...    "strand": ["+", "+", "-", "-"]
})

>>> result_df = calculate_exon_number(df)
>>> print(result_df)

This will output a DataFrame where exons, CDS, and introns are numbered according to their order and relationships within each transcript.

Notes

Exon Numbering:
- For transcripts on the positive strand (“+”), exons are numbered in ascending order based on their start positions.
- For transcripts on the negative strand (“-”), exons are numbered in descending order based on their end positions.
CDS Assignment:
- CDS regions inherit the exon number of the overlapping exon. CDS regions must be within the boundaries of a
single exon, otherwise the function might return erroneous results.
Intron Assignment:
- Introns are assigned the exon number based on the order the introns show up in the transcript.
the first intron will be assigned exon_number 1, so forth and so on.
Data Integrity:
- The function ensures that all required columns are present and correctly formatted before processing.
- If no CDS or introns are present in the data, the function handles these cases gracefully without errors.
Performance:
- Utilizes Polars’ efficient data manipulation capabilities to handle large genomic datasets effectively.