Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading a fasta file format into python dictionary

Tags:

python

I have a file with the following format:

>seq1
ATGGGTGTGTGTGTG
>seq2
ATGTGTTTGTGTGCTCCTCCTC
>seq3
AACGTCGTGACGGGTGCGTGGTGTGTGTCCAA

I want to read this file as a dictionary in Python. I am aware of BIO-python functions but I want to learn scripting in python in addition to getting my job done. I have tried this code so far:

import sys
sequence = ' '
fasta = {}
with open(sys.argv[1]) as file_one:
    file_one_content = file_one.read()
    for line in file_one_content.split("\n"):
        if not line.strip():
            continue
        if line.startswith(">"):
            sequence_name = line.rstrip('\n').replace(">", "")
        else:
            sequence = line.rstrip('\n')
        if sequence_name not in fasta:
            fasta[sequence_name] = []
        fasta[sequence_name].append(sequence)
print fasta

I get the following output:

{'seq3': ['ATGTGTTTGTGTGCTCCTCCTC', 'AACGTCGTGACGGGTGCGTGGTGTGTGTCCAA'], 'seq2': ['ATGGGTGTGTGTGTG', 'ATGTGTTTGTGTGCTCCTCCTC'], 'seq1': [' ', 'ATGGGTGTGTGTGTG']}

My expected output file is:

{'seq3': ['AACGTCGTGACGGGTGCGTGGTGTGTGTCCAA'], 'seq2': ['ATGTGTTTGTGTGCTCCTCCTC'], 'seq1': [ 'ATGGGTGTGTGTGTG']}

I have been trying to understand why the dictionary is printed in a wrong way but I can't find the mistake. As I want to learn, it would be great if you could let me know how I could correct the mistake in my code. Thanks.

like image 389
Homap Avatar asked Dec 11 '22 00:12

Homap


2 Answers

Using biopython SeqIO and a dict comprehension:

from Bio import SeqIO
seq_dict = {rec.id : rec.seq for rec in SeqIO.parse("myfile.fasta", "fasta")}
like image 84
Wouter De Coster Avatar answered Jan 16 '23 15:01

Wouter De Coster


Fixed. There was a problem with your if/else logic.

import sys
fasta = {}
with open(sys.argv[1]) as file_one:
    for line in file_one:
        line = line.strip()
        if not line:
            continue
        if line.startswith(">"):
            active_sequence_name = line[1:]
            if active_sequence_name not in fasta:
                fasta[active_sequence_name] = []
            continue
        sequence = line
        fasta[active_sequence_name].append(sequence)

print fasta


python fasta.py input.txt {'seq3': ['AACGTCGTGACGGGTGCGTGGTGTGTGTCCAA'], 'seq2' ['ATGTGTTTGTGTGCTCCTCCTC'], 'seq1': ['ATGGGTGTGTGTGTG']}
like image 22
Łukasz Rogalski Avatar answered Jan 16 '23 16:01

Łukasz Rogalski