I am trying to work through this bioinformatics problem: https://stepic.org/lesson/An-Explosion-of-Hidden-Messages-4/step/1?course=Bioinformatics-Algorithms-2&unit=8
The specific question is in the 5th window of the link above, and the question is: How many different 9-mers form (500,3)-clumps in the E. coli genome? (In other words, do not count a 9-mer more than once.)
My code is below. It is wrong, and I would love an explanation for why, and how I can improve it (obviously the O efficiency is terrible, but I started coding Python a few days ago...) Thanks so much!
genome = '' #insert e. Coli genome here
k = 4 #length of k-mer
L = 50 #size of sliding window
t = 3 #k-mer appears t times
counter = 0
Count = []
for i in range(0,len(genome)-L): #slide window down the genome
pattern = genome[i:i+k] #given this k-mer
for j in range(i,i+L): #calculate k-mer frequency in window of len(L)
if genome[j:j+k] == pattern:
counter = counter + 1
Count.append(counter)
counter = 0 #IMPORTANT: reset counter after each i
Clump = []
for i in range(0,len(Count)):
if Count[i] == t: #figure out the window that has k-mers of frequency t
Clump.append(i)
Output = []
for i in range(0,len(Clump)):
Output.append(genome[Clump[i]:Clump[i]+k])
print " ".join(list(set(Output))) #remove duplicates if a particular k-mer is found more than once
print len(Output)
print len(list(set(Output))) #total number of Clump(k,L,t)
Interesting problem. I've put up an implementation with a few tests on github here. Read on for some explanation.
ben@nixbox:~/bin$ time python kmers.py ../E-coli.txt 9 500 3
(500, 3)-clumps of 9-mers found in that file: 1904
real 0m15.510s
user 0m14.241s
sys 0m0.956s
This issue here (as is common in big data) really comes down to choosing the right data structure, and making some time/space tradeoffs. If you choose correctly, you can do this in time linear to the length of your genome and space linear to the length of your sliding window. But I'm getting ahead of myself. Let's explain the problem visually (mostly so I can understand it :-)).
There is a (20,3)-clump of 3-mers in this window: "CAT". There are some other ones too ("AAA" for one), but this example illustrates what k, L, and t are doing.
Now, on to the algorithm. Let's reduce the problem even further so we can visualize how we'd parse and store it: let's look at a simple (5,3)-clump of 3-mers.
Brackets denote our sliding window of width 5 here. We can see that in our window our 3-mers break down to ATA
, TAA
, and AAA
. When we slide the window one to the right, ATA
drops out and we gain a second AAA
. When we slide the window to the right another time, now TAA
drops out and we gain a third AAA
- and we've found a (5,3) clump of AAA
s.
Trivial, obviously, but useful for figuring out how we'd treat larger clumps - importantly, we don't discard the entire previous window's data when we shift our window; we just discard the first k-mer and append the new one at the end of our window. The next insight is that we can use a hash-backed structure (in python, dict
s) to do counting of k-mers within our window. This removes the need to linearly search through our data structure to figure out how many of a particular k-mer are in there.
So together those two requirements - remembering order of insertion and a hash-backed data structure - means that we should make a custom class that maintains a list
- or better, deque
- of each kmer that you have in your window, and a dict
- or better, Counter
- to keep track of the frequency of each kmer in your deque. Note that an OrderedDict
comes close to doing all of this for you, but not quite; popping off the oldest kmer would be wrong if its count is greater than 1.
Another thing that you really should be using to simplify your code here is a proper sliding window iterator.
Putting it all together:
def get_clumps(genome, k, L, t):
kmers = KmerSequence(L-k, t)
for kmer in sliding_window(genome, k):
kmers.add(kmer)
return kmers.clumps
class KmerSequence(object):
__slots__ = ['order', 'counts', 'limit', 'clumps', 't']
def __init__(self, limit, threshold):
self.order = deque()
self.counts = Counter()
self.limit = limit
self.clumps = set()
self.t = threshold
def add(self, kmer):
if len(self.order) > self.limit:
self._remove_oldest()
self._add_one(kmer)
def _add_one(self,kmer):
self.order.append(kmer)
new_count = self.counts[kmer] + 1
self.counts[kmer] = new_count
if new_count == self.t:
self.clumps.add(kmer)
def _remove_oldest(self):
self.counts[self.order.popleft()] -= 1
Usage:
with open(genomefile) as f:
genome = f.read()
k = 9
L = 500
t = 3
clumps = get_clumps(genome, k,L,t)
As mentioned at the top, the full code - which includes some accessory functions and handling for when the script is run directly as __main__
- is on github here. Feel free to fork, etc.
just another implementation
def findClumps(genome, k, L, t):
length = len(genome) - k + 1
indexes = defaultdict(list)
result = set()
for i in range(length):
kterm = genome[i:i + k]
# remove positions with distance > L
while indexes[kterm] and i + k - indexes[kterm][0] > L:
indexes[kterm].pop(0)
# add current kterm position
indexes[kterm].append(i)
if len(indexes[kterm]) == t:
result.add(kterm)
return result
filename = 'E-coli.txt'
file = open(filename)
genome = file.read()
k=9
L=500
t=3
def test():
print findClumps(genome, k, L, t)
import cProfile
cProfile.run('test()')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With