I found the code that parses the fasta frmated file. I need to count how many A, T, G and so on is in each sequence, for example:
>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
MRMRGRRLLPIIL
in this sequence theres:
M - 2
R - 4
G - 1
L - 3
I - 2
P - 1
The code is very simple:
def FASTA(filename):
try:
f = file(filename)
except IOError:
print "The file, %s, does not exist" % filename
return
order = []
sequences = {}
for line in f:
if line.startswith('>'):
name = line[1:].rstrip('\n')
name = name.replace('_', ' ')
order.append(name)
sequences[name] = ''
else:
sequences[name] += line.rstrip('\n').rstrip('*')
print "%d sequences found" % len(order)
return order, sequences
x, y = FASTA("drosoph_b.fasta")
but how can I count those amino acids? I dont want to use BioPython, I would like to know how to do this with, for example count...
As katrielalex points out, collections.Counter is a great fit for the task:
In [1]: from collections import Counter
In [2]: Counter('MRMRGRRLLPIIL')
Out[2]: Counter({'R': 4, 'L': 3, 'M': 2, 'I': 2, 'G': 1, 'P': 1})
You can apply it to the values of sequences dict in your code.
However, I'd recommend against using this code in real life. Libraries such as BioPython do it better. For example, the code you show will generate quite bulky data structures. As FASTA files are sometimes quite large, you can possibly run out of memory. Also, it doesn't handle possible exceptions in the best way possible.
Finally, using libraries just saves you time.
Example code with BioPython:
In [1]: from Bio import SeqIO
In [2]: from collections import Counter
In [3]: for entry in SeqIO.parse('1.fasta', 'fasta'):
...: print Counter(entry.seq)
...:
Counter({'R': 4, 'L': 3, 'I': 2, 'M': 2, 'G': 1, 'P': 1})
This can be obtained using very simple bash commands, my answer is below
cat input.fasta #my input file
>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
MRMRGRRLLPIIL
cat input.fasta | grep -v ">" | fold -w1 | sort | uniq -c
Output:
1 G
2 I
3 L
2 M
1 P
4 R
fold -w1 splits at every character, you sort them and count the unique ones
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With