Biopython SeqIO to Pandas Dataframe

Tags:

I have a FASTA file that can easily be parsed by SeqIO.parse.

I am interested in extracting sequence ID's and sequence lengths. I used these lines to do it, but I feel it's waaaay too heavy (two iterations, conversions, etc.)

from Bio import SeqIO
import pandas as pd


# parse sequence fasta file
identifiers = [seq_record.id for seq_record in SeqIO.parse("sequence.fasta",
                                                           "fasta")]
lengths = [len(seq_record.seq) for seq_record in SeqIO.parse("sequence.fasta",
                                                             "fasta")]
#converting lists to pandas Series    
s1 = Series(identifiers, name='ID')
s2 = Series(lengths, name='length')
#Gathering Series into a pandas DataFrame and rename index as ID column
Qfasta = DataFrame(dict(ID=s1, length=s2)).set_index(['ID'])

I could do it with only one iteration, but I get a dict :

records = SeqIO.parse(fastaFile, 'fasta')

and I somehow can't get DataFrame.from_dict to work...

My goal is to iterate the FASTA file, and get ids and sequences lengths into a DataFrame through each iteration.

Here is a short FASTA file for those who want to help.

591

asked Oct 17 '13 20:10

Sara

1 Answers

You're spot on - you definitely shouldn't be parsing the file twice, and storing the data in a dictionary is a waste of computing resources when you'll just be converting it to numpy arrays later.

SeqIO.parse() returns a generator, so you can iterate record-by-record, building a list like so:

with open('sequences.fasta') as fasta_file:  # Will close handle cleanly
    identifiers = []
    lengths = []
    for seq_record in SeqIO.parse(fasta_file, 'fasta'):  # (generator)
        identifiers.append(seq_record.id)
        lengths.append(len(seq_record.seq))

See Peter Cock's answer for a more efficient way of parsing just ID's and sequences from a FASTA file.

The rest of your code looks pretty good to me. However, if you really want to optimize for use with pandas, you can read below:

On minimizing memory usage

Consulting the source of panda.Series, we can see that data is stored interally as a numpy ndarray:

class Series(np.ndarray, Picklable, Groupable):
    """Generic indexed series (time series or otherwise) object.

    Parameters
    ----------
    data:  array-like
        Underlying values of Series, preferably as numpy ndarray

If you make identifiers an ndarray, it can be used directly in Series without constructing a new array (the parameter copy, default False) will prevent a new ndarray being created if not needed. By storing your sequences in a list, you'll force Series to coerce said list to an ndarray.

Avoid initializing lists

If you know in advance exactly how many sequences you have (and how long the longest ID will be), you could initialize an empty ndarray to hold identifiers like so:

num_seqs = 50
max_id_len = 60
numpy.empty((num_seqs, 1), dtype='S{:d}'.format(max_id_len))

Of course, it's pretty hard to know exactly how many sequences you'll have, or what the largest ID is, so it's easiest to just let numpy convert from an existing list. However, this is technically the fastest way to store your data for use in pandas.

answered Sep 19 '22 02:09

David Cain

Related questions
                            
                                Python - making copies of a file
                            
                                Java method which can provide the same output as Python method for HMAC-SHA256 in Hex
                            
                                How to obtain the training error in svm of Scikit-learn?
                            
                                How to Normalize similarity measures from Wordnet
                            
                                Installing python packages in nitrousio
                            
                                Is it possible to get a Flowable's coordinate position once it's rendered using ReportLab.platypus?
                            
                                Gensim Dictionary Implementation
                            
                                scikit-learn install failure / numpy not found / missing numpy headers
                            
                                PyLint 1.0.0 with PyDev + Eclipse: "include-ids" option no longer allowed, breaks Eclipse integration
                            
                                stacking sparse and dense matrices
                            
                                Communication between Python and Matlab
                            
                                Create constrained random numbers?
                            
                                How to remove a range of bytes from a bytes object in python?
                            
                                How to wait and get value of Span object in Selenium Python binding
                            
                                Is there a good way to download scipy, numpy, matplotlib, and pandas documentation for pylookup?
                            
                                How do I generate a spectrogram of a 1D signal in python?
                            
                                python re find string that may contain brackets
                            
                                Python using lambda to apply pd.DataFrame instead for nested loop is it possible?
                            
                                How to find possible English words in long random string?
                            
                                Does python support unicode beyond basic multilingual plane?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Biopython SeqIO to Pandas Dataframe

Tags:

python

pandas

biopython

fasta