Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble parsing FASTA files in Python

dict = {}
tag = ""
with open('/storage/emulated/0/Download/sequence.fasta.txt','r') as sequence:
    seq = sequence.readlines()
    for line in seq:
        if line.startswith(">"):
            tag = line.replace("\n", "")
        else:
            seq = "".join(seq[1:])
            dict[tag] = seq.replace("\n", "")   
    print(dict)

Background for those who arn't familiar with FASTA files. This format contains one or multiple DNA, RNA, or protein sequences with a one-line descriptive tag of the sequence that starts with a ">" and then the sequence in the following lines(Ex. For DNA it would be a lot of repeating of A, T, G, and C). It also comes with many unnecessary line breaks. So far this code works when I only have one sequence per file but it seems to ignore the if condition if there are multiple. For example it should add each new tag: sequence pair into the dictionary everytime it notices a ">" but instead it only runs once and puts the first description as the key in the dictionary and joins the rest of the file regardless of ">" characters and uses that as the value. How can I get this loop to notice a new ">" after the first occurrence?

I am purposefully steering away from the biopython module.

like image 328
Ethan Hetrick Avatar asked Apr 02 '26 04:04

Ethan Hetrick


1 Answers

UPDATE: the code below now works for multiple-line sequences.

The following code works fine for me:

import re
from collections import defaultdict

sequences = defaultdict(str)

with open('fasta.txt') as f:
    lines = f.readlines()

current_tag = None
for line in lines:
    m = re.match('^>(.+)', line)

    if m:
        current_tag = m.group(1)
    else:
        sequences[current_tag] += line.strip()

for k, v in sequences.items():
    print(f"{k}: {v}")

It uses a number of features you may be unfamiliar with, such as regular expressions (which are probably very useful in bioinformatics) and f-string formatting. If anything confuses you, ask away. One thing I should add is that you don't want to define a variable as dict because that will clobber something Python has defined at startup. I chose sequences, which doesn't do this and is more informative.

For reference, this is the content of the example FASTA file fasta.txt I used in this instance:

>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq5
SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
>seq6
FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
>seq7
SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq9
KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK
like image 82
readyready15728 Avatar answered Apr 03 '26 18:04

readyready15728



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!