I have a file with the following format:
>seq1
ATGGGTGTGTGTGTG
>seq2
ATGTGTTTGTGTGCTCCTCCTC
>seq3
AACGTCGTGACGGGTGCGTGGTGTGTGTCCAA
I want to read this file as a dictionary in Python. I am aware of BIO-python functions but I want to learn scripting in python in addition to getting my job done. I have tried this code so far:
import sys
sequence = ' '
fasta = {}
with open(sys.argv[1]) as file_one:
file_one_content = file_one.read()
for line in file_one_content.split("\n"):
if not line.strip():
continue
if line.startswith(">"):
sequence_name = line.rstrip('\n').replace(">", "")
else:
sequence = line.rstrip('\n')
if sequence_name not in fasta:
fasta[sequence_name] = []
fasta[sequence_name].append(sequence)
print fasta
I get the following output:
{'seq3': ['ATGTGTTTGTGTGCTCCTCCTC', 'AACGTCGTGACGGGTGCGTGGTGTGTGTCCAA'], 'seq2': ['ATGGGTGTGTGTGTG', 'ATGTGTTTGTGTGCTCCTCCTC'], 'seq1': [' ', 'ATGGGTGTGTGTGTG']}
My expected output file is:
{'seq3': ['AACGTCGTGACGGGTGCGTGGTGTGTGTCCAA'], 'seq2': ['ATGTGTTTGTGTGCTCCTCCTC'], 'seq1': [ 'ATGGGTGTGTGTGTG']}
I have been trying to understand why the dictionary is printed in a wrong way but I can't find the mistake. As I want to learn, it would be great if you could let me know how I could correct the mistake in my code. Thanks.
Using biopython SeqIO and a dict comprehension:
from Bio import SeqIO
seq_dict = {rec.id : rec.seq for rec in SeqIO.parse("myfile.fasta", "fasta")}
Fixed. There was a problem with your if/else logic.
import sys
fasta = {}
with open(sys.argv[1]) as file_one:
for line in file_one:
line = line.strip()
if not line:
continue
if line.startswith(">"):
active_sequence_name = line[1:]
if active_sequence_name not in fasta:
fasta[active_sequence_name] = []
continue
sequence = line
fasta[active_sequence_name].append(sequence)
print fasta
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With