Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Biopython to translate a series of DNA sequences in a FASTA file and extract the Protein sequences into a separate field?

I am new to Biopython (and coding in general) and am trying to code a way to translate a series of DNA sequences (more than 80) into protein sequences, in a separate FASTA file. I want to also find the sequence in the correct reading frame.

Here's what I have so far:

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

for record in SeqIO.parse("dnaseq.fasta", "fasta"):
    protein_id = record.id
    protein1 = record.seq.translate(to_stop=True)
    protein2 = record.seq[1:].translate(to_stop=True)
    protein3 = record.seq[2:].translate(to_stop=True)

if len(protein1) > len(protein2) and len(protein1) > len(protein3):
    protein = protein1
elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
    protein = protein2
else:
    protein = protein3

def prot_record(record):
    return SeqRecord(seq = protein, \
             id = ">" + protein_id, \
             description = "translated sequence")

records = map(prot_record, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")

The problem with my current code is that while it seems to work, it only give the last sequence of the input file. Can anyone help me figure out how to write all of the sequences?

Thank you!

like image 883
macrosage Avatar asked Oct 16 '25 15:10

macrosage


1 Answers

As mentioned by others, your code is iterating through the entire input before attempting to write the result. I wanted to suggest how one might do this with a streaming approach:

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

with open("AAseq.fasta", 'w') as aa_fa:
    for dna_record in SeqIO.parse("dnaseq.fasta", 'fasta'):
        # use both fwd and rev sequences
        dna_seqs = [dna_record.seq, dna_record.seq.reverse_complement()]

        # generate all translation frames
        aa_seqs = (s[i:].translate(to_stop=True) for i in range(3) for s in dna_seqs)

        # select the longest one
        max_aa = max(aa_seqs, key=len)

        # write new record
        aa_record = SeqRecord(max_aa, id=dna_record.id, description="translated sequence")
        SeqIO.write(aa_record, aa_fa, 'fasta')

The main improvements here are:

  1. Individual records are translated and outputted in each iteration, minimizing memory usage.
  2. Adds support for reverse complements.
  3. Translated frames are created through a generator comprehension, and only the longest length one is stored.
  4. Avoids if...elif...else structures by instead using max with a key.
like image 66
merv Avatar answered Oct 18 '25 14:10

merv



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!