RNA_pysoforms.shorten_gaps

RNApysoforms.shorten_gaps(annotation: DataFrame, transcript_id_column: str = 'transcript_id', target_gap_width: int = 100) → DataFrame[source]

Shortens intron and transcript start gaps between exons in genomic annotations to enhance visualization.

This function processes genomic annotations by shortening the widths of intron gaps and gaps at the start of transcripts to a specified target size, while preserving exon and CDS regions. The goal is to improve the clarity of transcript visualizations by reducing the visual space occupied by long intron regions and aligning transcripts for consistent rescaling, maintaining the relative structure of the transcripts.

Parameters:

annotation (pl.DataFrame) – A Polars DataFrame containing genomic annotations, including exons and optionally CDS and intron data. Required columns include: - ‘start’: Start position of the feature. - ‘end’: End position of the feature. - ‘type’: Feature type, expected to include ‘exon’ and optionally ‘intron’ and ‘CDS’. - ‘strand’: Strand information (‘+’ or ‘-‘). - ‘seqnames’: Chromosome or sequence name. - transcript_id_column: Column used to group transcripts, typically ‘transcript_id’.
transcript_id_column (str, optional) – The column used to group transcripts, by default “transcript_id”. This identifies individual transcripts within the annotation data.
target_gap_width (int, optional) – The maximum width for intron gaps and transcript start gaps after shortening. Gaps wider than this will be reduced to this size. Default is 100.

Returns:

A Polars DataFrame with shortened intron and transcript start gaps and rescaled coordinates for exons, introns, and CDS regions. The DataFrame includes: - Original columns from the input DataFrame. - ‘rescaled_start’: The rescaled start position after shortening gaps. - ‘rescaled_end’: The rescaled end position after shortening gaps.

Return type:

pl.DataFrame

Raises:

TypeError – If ‘annotation’ is not a Polars DataFrame.
ValueError – If required columns are missing in the input DataFrame. If exons are not from a single chromosome and strand when calculating gaps. If there are no common columns to join on between CDS and exons when processing CDS regions.

Examples

Shorten intron and transcript start gaps in a genomic annotation DataFrame:

>>> import polars as pl
>>> from RNApysoforms import shorten_gaps
>>> df = pl.DataFrame({
...    "transcript_id": ["tx1", "tx1", "tx1"],
...    "start": [100, 200, 500],
...    "end": [150, 250, 600],
...    "type": ["exon", "exon", "exon"],
...    "strand": ["+", "+", "+"],
...    "seqnames": ["chr1", "chr1", "chr1"],
...    "exon_number": [1, 2, 3]
... })

>>> shortened_df = shorten_gaps(df, transcript_id_column="transcript_id", target_gap_width=50)
>>> print(shortened_df.head())

This will return a DataFrame where the intron and transcript start gaps have been shortened to a maximum width of 50, and includes rescaled coordinates for visualization.

Notes

The function ensures that exon and CDS regions maintain their original lengths, while intron gaps and transcript start gaps are shortened.
If intron entries are not present in the input DataFrame, the function generates them using the ‘to_intron’ function.
The input DataFrame must contain the required columns listed above.
The function processes gaps at the start of transcripts to align transcripts for consistent rescaling.
After shortening gaps, the coordinates are rescaled to maintain the relative positions of features within and across transcripts.
The function returns the rescaled DataFrame with original columns plus ‘rescaled_start’ and ‘rescaled_end’.