RNA_pysoforms.to_intron

RNApysoforms.to_intron(annotation: DataFrame, transcript_id_column: str = 'transcript_id') DataFrame[source]

Converts exon coordinates into corresponding intron coordinates within a genomic annotation dataset.

This function identifies introns by calculating the genomic intervals between consecutive exons for each transcript. It returns a DataFrame with the calculated intron coordinates and retains relevant grouping based on the specified transcript_id_column, typically ‘transcript_id’. If intron entries are already present

in the input data, the function exits with an error.

Parameters:
  • annotation (pl.DataFrame) – A Polars DataFrame containing genomic annotations, including exon coordinates with the following required columns: - seqnames: Chromosome or sequence name. - start: Start position of the exon. - end: End position of the exon. - type: Feature type, expected to include “exon”. - exon_number: Numerical identifier for exons. - transcript_id_column: Column used to group transcripts, typically “transcript_id”.

  • transcript_id_column (str, optional) – The column used to group data, typically ‘transcript_id’. Default is ‘transcript_id’.

Returns:

A Polars DataFrame containing both exon and intron coordinates, including other genomic features such as CDS if present. The DataFrame includes the following columns: - seqnames - start - end - type (“exon” or “intron”) - exon_number - Additional columns from the input DataFrame.

Return type:

pl.DataFrame

Raises:
  • TypeError – If annotation is not a Polars DataFrame.

  • ValueError – If the input DataFrame does not contain the required columns (seqnames, start, end, type, exon_number, and transcript_id_column). If the input DataFrame already contains introns.

Examples

Convert exons into introns:

>>> import polars as pl
>>> from RNApysoforms import to_intron
>>> df = pl.DataFrame({
...    "seqnames": ["chr1", "chr1", "chr1"],
...    "start": [100, 200, 300],
...    "end": [150, 250, 350],
...    "type": ["exon", "exon", "exon"],
...    "transcript_id": ["tx1", "tx1", "tx1"],
...    "strand": ["+", "+", "+"],
...    "exon_number": [1, 2, 3]
... })
>>> df_with_introns = to_intron(df, transcript_id_column="transcript_id")
>>> print(df_with_introns.head())

This will return a DataFrame with calculated intron positions between the provided exon coordinates.

Notes

  • The function filters out invalid introns where start or end is null, and introns with length ≤ 1 are discarded.

  • The input DataFrame must contain the required columns listed above.

  • The function can handle input DataFrames with or without existing intron entries. If intron entries are absent, the function generates them.

  • Additional genomic features (e.g., CDS) present in the input DataFrame are retained and merged with intron entries.

  • The function does not adjust intron positions by adding or subtracting 1; intron positions are directly taken from exon boundaries.