I am writing a function that is supposed to go through a .fasta file of DNA sequences and create a dictionary of nucleotide (nt) and dinucleotide (dnt) frequencies for each sequence in the file. I am then storing each dictionary in a list called "frequency". This is the piece of code that is acting strange:
for fasta in seq_file:
freq = {}
dna = str(fasta.seq)
for base1 in ['A', 'T', 'G', 'C']:
onefreq = float(dna.count(base1)) / len(dna)
freq[base1] = onefreq
for base2 in ['A', 'T', 'G', 'C']:
dinucleotide = base1 + base2
twofreq = float(dna.count(dinucleotide)) / (len(dna) - 1)
freq[dinucleotide] = twofreq
frequency.append(freq)
(I am using biopython, by the way, so that I do not have to commit the entire fasta file to memory. Also this is for ssDNA, so I don't have to account for anti-sense dnt)
The frequencies that are recorded for the single nt add to 1.0, but the frequencies for the dnt do not add to 1.0. Which is od since the method of calculating the two types of frequencies are identical in my eyes.
I left the diagnostic print-statements and the "check" variables in:
for fasta in seq_file:
freq = {}
dna = str(fasta.seq)
check = 0.0
check2 = 0.0
for base1 in ['A', 'T', 'G', 'C']:
onefreq = float(dna.count(base1)) / len(dna)
freq[base1] = onefreq
check2 += onefreq
for base2 in ['A', 'T', 'G', 'C']:
dinucleotide = base1 + base2
twofreq = float(dna.count(dinucleotide)) / (len(dna) - 1)
check += twofreq
print(twofreq)
freq[dinucleotide] = twofreq
print("\n")
print(check, check2)
print(len(dna))
print("\n")
frequency.append(freq)
to get this output: (for only one sequence)
0.0894168466523
0.0760259179266
0.0946004319654
0.0561555075594
0.0431965442765
0.0423326133909
0.0747300215983
0.0488120950324
0.0976241900648
0.0483801295896
0.0539956803456
0.0423326133909
0.0863930885529
0.0419006479482
0.0190064794816
0.031101511879
(0.9460043196544274, 1.0)
2316
Here we can see the frequency of each of the 16 different dnt possible, the sum of all dnt frequencies (0.946) and the sum of all nt frequencies (1.0) and the length of the sequence.
Why does the dnt frequency not add up to 1.0?
Thanks for your help. I'm very new to python, and this is my first question, so I hope that this submissions is acceptable.
your problem, try with following fasta:
>test AAAAAA
"AAAAAA".count("AA")
you get:
3
It should be
5
reason
from documentation: count return the number of (non-overlapping) occurrences of substring sub in string s[start:end]
solution using Counter and chunk function
from Bio import SeqIO
def chunks(l, n):
for i in xrange(0, len(l)-(n-1)):
yield l[i:i+n]
from collections import Counter
frequency = []
input_file = "test.fasta"
for fasta in SeqIO.parse(open(input_file), "fasta"):
dna = str(fasta.seq)
freq = Counter(dna) #get counter of single bases
freq.update(Counter(chunks(dna,2))) #update with counter of dinucleotide
frequency.append(freq)
for "AAAAAA" you get:
Counter({'A': 6, 'AA': 5})
str.count() doesnt count overlapping motif it find.
Exemple:
If you have 'AAAA' in your sequence and you look for the dinucleotide 'AA', you expect than 'AAAA'.count('AA') return you 3, but it will return 2. So:
print float('AAAA'.count('AA')) / (len('AAAA') - 1)
0.666666
instead of 1
You can just change the line where you count the frequency by:
twofreq = len([i for i in range(len(dna)-1) if dna[i:i+2] == dinucleotide]) / float((len(dna) - 1))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With