Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Frequencies not adding up to one

I am writing a function that is supposed to go through a .fasta file of DNA sequences and create a dictionary of nucleotide (nt) and dinucleotide (dnt) frequencies for each sequence in the file. I am then storing each dictionary in a list called "frequency". This is the piece of code that is acting strange:

for fasta in seq_file:
    freq = {}
    dna = str(fasta.seq)
    for base1 in ['A', 'T', 'G', 'C']:
        onefreq = float(dna.count(base1)) / len(dna)
        freq[base1] = onefreq
        for base2 in ['A', 'T', 'G', 'C']:
            dinucleotide = base1 + base2
            twofreq = float(dna.count(dinucleotide)) / (len(dna) - 1) 
            freq[dinucleotide] = twofreq
    frequency.append(freq)

(I am using biopython, by the way, so that I do not have to commit the entire fasta file to memory. Also this is for ssDNA, so I don't have to account for anti-sense dnt)

The frequencies that are recorded for the single nt add to 1.0, but the frequencies for the dnt do not add to 1.0. Which is od since the method of calculating the two types of frequencies are identical in my eyes.

I left the diagnostic print-statements and the "check" variables in:

for fasta in seq_file:
    freq = {}
    dna = str(fasta.seq)
    check = 0.0
    check2 = 0.0
    for base1 in ['A', 'T', 'G', 'C']:
        onefreq = float(dna.count(base1)) / len(dna)
        freq[base1] = onefreq
        check2 += onefreq
        for base2 in ['A', 'T', 'G', 'C']:
            dinucleotide = base1 + base2
            twofreq = float(dna.count(dinucleotide)) / (len(dna) - 1) 
            check += twofreq
            print(twofreq)
            freq[dinucleotide] = twofreq
    print("\n")
    print(check, check2)
    print(len(dna))
    print("\n")
    frequency.append(freq)

to get this output: (for only one sequence)

0.0894168466523 
0.0760259179266
0.0946004319654
0.0561555075594
0.0431965442765
0.0423326133909
0.0747300215983
0.0488120950324
0.0976241900648
0.0483801295896
0.0539956803456
0.0423326133909
0.0863930885529
0.0419006479482
0.0190064794816
0.031101511879


(0.9460043196544274, 1.0)
2316

Here we can see the frequency of each of the 16 different dnt possible, the sum of all dnt frequencies (0.946) and the sum of all nt frequencies (1.0) and the length of the sequence.

Why does the dnt frequency not add up to 1.0?

Thanks for your help. I'm very new to python, and this is my first question, so I hope that this submissions is acceptable.

like image 594
Bantha Avatar asked May 27 '15 16:05

Bantha


Video Answer


2 Answers

your problem, try with following fasta:

>test
AAAAAA
"AAAAAA".count("AA")

you get:

3

It should be

5

reason

from documentation: count return the number of (non-overlapping) occurrences of substring sub in string s[start:end]

solution using Counter and chunk function

from Bio import SeqIO

def chunks(l, n):
  for i in xrange(0, len(l)-(n-1)):
    yield l[i:i+n]

from collections import Counter

frequency = []
input_file = "test.fasta"
for fasta in SeqIO.parse(open(input_file), "fasta"):
  dna = str(fasta.seq)
  freq = Counter(dna)   #get counter of single bases
  freq.update(Counter(chunks(dna,2))) #update with counter of dinucleotide
  frequency.append(freq)

for "AAAAAA" you get:

Counter({'A': 6, 'AA': 5})
like image 69
Jose Ricardo Bustos M. Avatar answered Sep 28 '22 07:09

Jose Ricardo Bustos M.


str.count() doesnt count overlapping motif it find.

Exemple:

If you have 'AAAA' in your sequence and you look for the dinucleotide 'AA', you expect than 'AAAA'.count('AA') return you 3, but it will return 2. So:

print float('AAAA'.count('AA')) / (len('AAAA') - 1)
0.666666

instead of 1

You can just change the line where you count the frequency by:

twofreq = len([i for i in range(len(dna)-1) if dna[i:i+2] == dinucleotide]) / float((len(dna) - 1))
like image 33
Data_addict Avatar answered Sep 28 '22 09:09

Data_addict