How to use Biopython to translate a series of DNA sequences in a FASTA file and extract the Protein sequences into a separate field?

Question

I am new to Biopython (and coding in general) and am trying to code a way to translate a series of DNA sequences (more than 80) into protein sequences, in a separate FASTA file. I want to also find the sequence in the correct reading frame.

Here's what I have so far:

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

for record in SeqIO.parse("dnaseq.fasta", "fasta"):
    protein_id = record.id
    protein1 = record.seq.translate(to_stop=True)
    protein2 = record.seq[1:].translate(to_stop=True)
    protein3 = record.seq[2:].translate(to_stop=True)

if len(protein1) > len(protein2) and len(protein1) > len(protein3):
    protein = protein1
elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
    protein = protein2
else:
    protein = protein3

def prot_record(record):
    return SeqRecord(seq = protein, \
             id = ">" + protein_id, \
             description = "translated sequence")

records = map(prot_record, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")

The problem with my current code is that while it seems to work, it only give the last sequence of the input file. Can anyone help me figure out how to write all of the sequences?

Thank you!

merv · Accepted Answer

As mentioned by others, your code is iterating through the entire input before attempting to write the result. I wanted to suggest how one might do this with a streaming approach:

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

with open("AAseq.fasta", 'w') as aa_fa:
    for dna_record in SeqIO.parse("dnaseq.fasta", 'fasta'):
        # use both fwd and rev sequences
        dna_seqs = [dna_record.seq, dna_record.seq.reverse_complement()]

        # generate all translation frames
        aa_seqs = (s[i:].translate(to_stop=True) for i in range(3) for s in dna_seqs)

        # select the longest one
        max_aa = max(aa_seqs, key=len)

        # write new record
        aa_record = SeqRecord(max_aa, id=dna_record.id, description="translated sequence")
        SeqIO.write(aa_record, aa_fa, 'fasta')

The main improvements here are:

Individual records are translated and outputted in each iteration, minimizing memory usage.
Adds support for reverse complements.
Translated frames are created through a generator comprehension, and only the longest length one is stored.
Avoids if...elif...else structures by instead using max with a key.

How to use Biopython to translate a series of DNA sequences in a FASTA file and extract the Protein sequences into a separate field?

Tags:

python

parsing

bioinformatics

biopython

fasta

macrosage

1 Answers

merv

Recent Activity

Donate For Us

How to use Biopython to translate a series of DNA sequences in a FASTA file and extract the Protein sequences into a separate field?

Tags:

python

parsing

bioinformatics

biopython

fasta

macrosage

1 Answers

merv

Related questions

Recent Activity

Donate For Us