Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to iterator over every [:2] overlapping characters in a string of DNA code?

Let's say I have a string of DNA 'GAAGGAGCGGCGCCCAAGCTGAGATAGCGGCTAGAGGCGGGTAACCGGCA'

Consider the first 5 letters: GAAGG

And I want to replace each overlapping bi-gram 'GA','AA','AG','GG' with some number that corresponds to their likelihood of occurrence, summing them. Like 'GA' = 1, 'AA' = 2, 'AG' = .7, 'GG' = .5. So for GAAGG I would have my sumAnswer = 1 + 2 + .7 + 5.

So in pseduo code, I want to... -iterate over each overlapping bi-gram in my DNA string -find the corresponding value to each unique bi-gram pair -sum each value iteratively

I'm not enitrely sure how to iterate over each pair. I thought a for loop would work, but that doesn't account for the overlap: it prints every 2-pair (GAGC = GA,GC), not every overlapping 2-pair (GAGC = GA,AG,GC)

for i in range(0, len(input), 2):
      print input[i:i+2]

Any tips?

like image 322
bambo222 Avatar asked Dec 04 '22 08:12

bambo222


1 Answers

Forget playing with range and index arithmetic, iterating over pairs is exactly what zip is for:

>>> dna = 'GAAGG'
>>> for bigram in zip(dna, dna[1:]):
...    print(bigram)
... 
('G', 'A')
('A', 'A')
('A', 'G')
('G', 'G')

If you have the corresponding likelihoods stored in a dictionary, like so:

likelihood = {
   'GA': 1, 
   'AA': 2,
   'AG': .7, 
   'GG': .5
}

then you can sum them quite easily with the unsurprisingly named sum:

>>> sum(likelihood[''.join(bigram)] for bigram in zip(dna,dna[1:]))
4.2
like image 190
lvc Avatar answered Dec 29 '22 18:12

lvc