RNApysoforms.read_ensembl_gtf

RNApysoforms.read_ensembl_gtf(path: str) DataFrame[source]

Reads a GTF (Gene Transfer Format) file and returns the data as a Polars DataFrame.

This function parses an ENSEMBL GTF file to extract genomic features, specifically focusing on ‘exon’ and ‘CDS’ (Coding DNA Sequence) feature types. It extracts key attributes from the ‘attributes’ column, such as gene_id, gene_name, transcript_id, transcript_name, transcript_biotype, and exon_number. The function performs several validation checks on the file path and file content, and handles missing values by substituting default values where necessary. The resulting DataFrame includes detailed information about each transcript feature, suitable for downstream genomic analyses.

Expected Columns in GTF File: The GTF file is expected to be an ENSEMBL GTF file, have no header and the following tab-separated columns:

  1. seqnames (chromosome or sequence name)

  2. source (annotation source)

  3. type (feature type, e.g., ‘exon’, ‘CDS’)

  4. start (start position of the feature)

  5. end (end position of the feature)

  6. score (score value, often ‘.’)

  7. strand (strand information, ‘+’ or ‘-‘)

  8. phase (reading frame phase)

  9. attributes (semicolon-separated key-value pairs)

Parameters:

path (str) – The file path to the ENSEMBL GTF file to be read. The file must have a ‘.gtf’ extension.

Returns:

A Polars DataFrame containing extracted gene and transcript features. The DataFrame includes the following columns: - gene_id: Identifier for the gene. - gene_name: Name of the gene. If missing, filled with gene_id. - transcript_id: Identifier for the transcript. - transcript_name: Name of the transcript. If missing, filled with transcript_id. - transcript_biotype: Biotype classification of the transcript. - seqnames: Chromosome or sequence name. - strand: Strand information (‘+’ or ‘-‘). - type: Feature type (‘exon’ or ‘CDS’). - start: Start position of the feature. - end: End position of the feature. - exon_number: Exon number within the transcript, cast to Int64.

Return type:

pl.DataFrame

Raises:

ValueError – If the file path does not exist. If the path is not a file. If the file does not have a ‘.gtf’ extension. If required columns are missing or the file cannot be read properly.

Examples

Read a GTF file and display the first few rows:

>>> from RNApysoforms import read_gtf
>>> df = read_gtf("/path/to/file.gtf")
>>> print(df.head())

Notes

  • The function uses lazy evaluation for reading and processing the file, which is efficient for large GTF files.

  • It filters out feature types other than ‘exon’ and ‘CDS’ to focus on relevant transcript features.

  • Regular expressions are used to extract specific attributes from the ‘attributes’ column.

  • Missing gene_name and transcript_name values are filled with gene_id and transcript_id, respectively.

  • The ‘exon_number’ field is cast to Int64, handling possible nulls without strict type enforcement.

  • The function returns a collected Polars DataFrame after all lazy operations are executed.