RNA_pysoforms.to_intron

RNApysoforms.to_intron(annotation: DataFrame, transcript_id_column: str = 'transcript_id') → DataFrame[source]

Converts exon coordinates into corresponding intron coordinates within a genomic annotation dataset.

This function identifies introns by calculating the genomic intervals between consecutive exons for each transcript. It returns a DataFrame with the calculated intron coordinates and retains relevant grouping based on the specified transcript_id_column, typically ‘transcript_id’. If intron entries are already present

in the input data, the function exits with an error.

Parameters:

annotation (pl.DataFrame) – A Polars DataFrame containing genomic annotations, including exon coordinates with the following required columns: - seqnames: Chromosome or sequence name. - start: Start position of the exon. - end: End position of the exon. - type: Feature type, expected to include “exon”. - exon_number: Numerical identifier for exons. - transcript_id_column: Column used to group transcripts, typically “transcript_id”.
transcript_id_column (str, optional) – The column used to group data, typically ‘transcript_id’. Default is ‘transcript_id’.

Returns:

A Polars DataFrame containing both exon and intron coordinates, including other genomic features such as CDS if present. The DataFrame includes the following columns: - seqnames - start - end - type (“exon” or “intron”) - exon_number - Additional columns from the input DataFrame.

Return type:

pl.DataFrame

Raises:

TypeError – If annotation is not a Polars DataFrame.
ValueError – If the input DataFrame does not contain the required columns (seqnames, start, end, type, exon_number, and transcript_id_column). If the input DataFrame already contains introns.

Examples

Convert exons into introns:

>>> import polars as pl
>>> from RNApysoforms import to_intron
>>> df = pl.DataFrame({
...    "seqnames": ["chr1", "chr1", "chr1"],
...    "start": [100, 200, 300],
...    "end": [150, 250, 350],
...    "type": ["exon", "exon", "exon"],
...    "transcript_id": ["tx1", "tx1", "tx1"],
...    "strand": ["+", "+", "+"],
...    "exon_number": [1, 2, 3]
... })

>>> df_with_introns = to_intron(df, transcript_id_column="transcript_id")
>>> print(df_with_introns.head())

This will return a DataFrame with calculated intron positions between the provided exon coordinates.

Notes

The function filters out invalid introns where start or end is null, and introns with length ≤ 1 are discarded.
The input DataFrame must contain the required columns listed above.
The function can handle input DataFrames with or without existing intron entries. If intron entries are absent, the function generates them.
Additional genomic features (e.g., CDS) present in the input DataFrame are retained and merged with intron entries.
The function does not adjust intron positions by adding or subtracting 1; intron positions are directly taken from exon boundaries.