I have been working on a program which requires the counting of sub-strings (up to 4000 sub-strings of 2-6 characters located in a list) inside a main string (~400,000 characters). I understand this is similar to the question asked at Counting substrings in a string, however, this solution does not work for me. Since my sub-strings are DNA sequences, many of my sub-strings are repetitive instances of a single character (e.g. 'AA'); therefore, 'AAA' will be interpreted as a single instance of 'AA' rather than two instances if i split the string by 'AA'. My current solution is using nested loops, but I'm hoping there is a faster way as this code takes 5+ minutes for a single main string. Thanks in advance!
def getKmers(self, kmer):
self.kmer_dict = {}
kmer_tuples = list(product(['A', 'C', 'G', 'T'], repeat = kmer))
kmer_list = []
for x in range(len(kmer_tuples)):
new_kmer = ''
for y in range(kmer):
new_kmer += kmer_tuples[x][y]
kmer_list.append(new_kmer)
for x in range(len(kmer_list)):
self.kmer_dict[kmer_list[x]] = 0
for x in range(len(self.sequence)-kmer):
for substr in kmer_list:
if self.sequence[x:x+kmer] == substr:
self.kmer_dict[substr] += 1
break
return self.kmer_dict
For counting overlapping substrings of DNA, you can use Biopython:
>>> from Bio.Seq import Seq
>>> Seq('AAA').count_overlap('AA')
2
Disclaimer: I wrote this method, see commit 97709cc
.
However, if you're looking for really high performance, Python probably isn't the right language choice (although an extension like Cython could help).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With