Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-bio extract genomic features from gff3 file

Is it possible in scikit-bio to extract genomic features stored in a gff3 formatted file from a genome fasta file?

Example:


genome.fasta

>sequence1
ATGGAGAGAGAGAGAGAGAGGGGGCAGCATACGCATCGACATACGACATACATCAGATACGACATACTACTACTATGA

annotation.gff3

#gff-version 3
sequence1   source  gene    1   78  .   +   .   ID=gene1
sequence1   source  mRNA    1   78  .   +   .   ID=transcript1;parent=gene1
sequence1   source  CDS 1   6   .   +   0   ID=CDS1;parent=transcript1
sequence1   source  CDS 73  78  .   +   0   ID=CDS2;parent=transcript1

The desired sequence for the mRNA feature (transcript1) would be the concatination of the two child CDS features. So in this case this would be 'ATGGAGCTATGA'.

like image 383
holmrenser Avatar asked Jul 11 '16 07:07

holmrenser


1 Answers

This feature has been added to scikit-bio, however the version available in bioconda does not support it yet (2017-12-15). The format file for gff3 is present in the Github repository.

You can clone the repo and install it locally using:

$ git clone https://github.com/biocore/scikit-bio.git
$ cd scikit-bio
$ python setup.py install

Following the example given in the file the following code should work:

import io
from skbio.metadata import IntervalMetadata
from skbio.io import read

gff = io.StringIO(open("annotations.gff3", "r").read())
im = read(gff, format='gff3', into=IntervalMetadata, seq_id="sequence1")

print(im)

For me this this raises a FormatIdentificationWarning, but the entries are reported correctly:

4 interval features
-------------------
Interval(interval_metadata=<140154121000104>, bounds=[(0, 78)], fuzzy=[(False, False)], metadata={'source': 'source', 'type': 'gene', 'score': '.', 'strand': '+', 'ID': 'gene1'})
Interval(interval_metadata=<140154121000104>, bounds=[(0, 78)], fuzzy=[(False, False)], metadata={'source': 'source', 'type': 'mRNA', 'score': '.', 'strand': '+', 'ID': 'transcript1', 'parent': 'gene1'})
Interval(interval_metadata=<140154121000104>, bounds=[(0, 6)], fuzzy=[(False, False)], metadata={'source': 'source', 'type': 'CDS', 'score': '.', 'strand': '+', 'phase': 0, 'ID': 'CDS1', 'parent': 'transcript1'})
Interval(interval_metadata=<140154121000104>, bounds=[(72, 78)], fuzzy=[(False, False)], metadata={'source': 'source', 'type': 'CDS', 'score': '.', 'strand': '+', 'phase': 0, 'ID': 'CDS2', 'parent': 'transcript1'})

In the example in the code, the GFF3 and the FASTA file are concatenated in the input string used for the read function. Maybe that can fix this issue. Also I am not 100% sure how you can use the intervals returned to extract the feature.

like image 114
m00am Avatar answered Oct 22 '22 18:10

m00am