Let's say I have a string of DNA 'GAAGGAGCGGCGCCCAAGCTGAGATAGCGGCTAGAGGCGGGTAACCGGCA'
Consider the first 5 letters: GAAGG
And I want to replace each overlapping bi-gram 'GA','AA','AG','GG' with some number that corresponds to their likelihood of occurrence, summing them. Like 'GA' = 1, 'AA' = 2, 'AG' = .7, 'GG' = .5. So for GAAGG I would have my sumAnswer = 1 + 2 + .7 + 5.
So in pseduo code, I want to... -iterate over each overlapping bi-gram in my DNA string -find the corresponding value to each unique bi-gram pair -sum each value iteratively
I'm not enitrely sure how to iterate over each pair. I thought a for loop would work, but that doesn't account for the overlap: it prints every 2-pair (GAGC = GA,GC), not every overlapping 2-pair (GAGC = GA,AG,GC)
for i in range(0, len(input), 2):
print input[i:i+2]
Any tips?
Forget playing with range
and index arithmetic, iterating over pairs is exactly what zip
is for:
>>> dna = 'GAAGG'
>>> for bigram in zip(dna, dna[1:]):
... print(bigram)
...
('G', 'A')
('A', 'A')
('A', 'G')
('G', 'G')
If you have the corresponding likelihoods stored in a dictionary, like so:
likelihood = {
'GA': 1,
'AA': 2,
'AG': .7,
'GG': .5
}
then you can sum them quite easily with the unsurprisingly named sum
:
>>> sum(likelihood[''.join(bigram)] for bigram in zip(dna,dna[1:]))
4.2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With