Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to count instances of substrings in string Python3.6

I have been working on a program which requires the counting of sub-strings (up to 4000 sub-strings of 2-6 characters located in a list) inside a main string (~400,000 characters). I understand this is similar to the question asked at Counting substrings in a string, however, this solution does not work for me. Since my sub-strings are DNA sequences, many of my sub-strings are repetitive instances of a single character (e.g. 'AA'); therefore, 'AAA' will be interpreted as a single instance of 'AA' rather than two instances if i split the string by 'AA'. My current solution is using nested loops, but I'm hoping there is a faster way as this code takes 5+ minutes for a single main string. Thanks in advance!

def getKmers(self, kmer):
    self.kmer_dict = {}
    kmer_tuples = list(product(['A', 'C', 'G', 'T'], repeat = kmer))
    kmer_list = []
    for x in range(len(kmer_tuples)):
        new_kmer = ''
        for y in range(kmer):
            new_kmer += kmer_tuples[x][y]
        kmer_list.append(new_kmer)
    for x in range(len(kmer_list)):
        self.kmer_dict[kmer_list[x]] = 0
    for x in range(len(self.sequence)-kmer):
        for substr in kmer_list:
            if self.sequence[x:x+kmer] == substr:
                self.kmer_dict[substr] += 1
                break
    return self.kmer_dict
like image 600
DanStu Avatar asked Jan 02 '23 09:01

DanStu


1 Answers

For counting overlapping substrings of DNA, you can use Biopython:

>>> from Bio.Seq import Seq
>>> Seq('AAA').count_overlap('AA')
2

Disclaimer: I wrote this method, see commit 97709cc.

However, if you're looking for really high performance, Python probably isn't the right language choice (although an extension like Cython could help).

like image 196
Chris_Rands Avatar answered Jan 04 '23 23:01

Chris_Rands