Fastest way to count instances of substrings in string Python3.6

Question

I have been working on a program which requires the counting of sub-strings (up to 4000 sub-strings of 2-6 characters located in a list) inside a main string (~400,000 characters). I understand this is similar to the question asked at Counting substrings in a string, however, this solution does not work for me. Since my sub-strings are DNA sequences, many of my sub-strings are repetitive instances of a single character (e.g. 'AA'); therefore, 'AAA' will be interpreted as a single instance of 'AA' rather than two instances if i split the string by 'AA'. My current solution is using nested loops, but I'm hoping there is a faster way as this code takes 5+ minutes for a single main string. Thanks in advance!

def getKmers(self, kmer):
    self.kmer_dict = {}
    kmer_tuples = list(product(['A', 'C', 'G', 'T'], repeat = kmer))
    kmer_list = []
    for x in range(len(kmer_tuples)):
        new_kmer = ''
        for y in range(kmer):
            new_kmer += kmer_tuples[x][y]
        kmer_list.append(new_kmer)
    for x in range(len(kmer_list)):
        self.kmer_dict[kmer_list[x]] = 0
    for x in range(len(self.sequence)-kmer):
        for substr in kmer_list:
            if self.sequence[x:x+kmer] == substr:
                self.kmer_dict[substr] += 1
                break
    return self.kmer_dict

Chris_Rands · Accepted Answer

For counting overlapping substrings of DNA, you can use Biopython:

>>> from Bio.Seq import Seq
>>> Seq('AAA').count_overlap('AA')
2

Disclaimer: I wrote this method, see commit 97709cc.

However, if you're looking for really high performance, Python probably isn't the right language choice (although an extension like Cython could help).

Fastest way to count instances of substrings in string Python3.6

Tags:

performance

python

string

bioinformatics

biopython

DanStu

1 Answers

Chris_Rands

Recent Activity

Donate For Us

Fastest way to count instances of substrings in string Python3.6

Tags:

performance

python

string

bioinformatics

biopython

DanStu

1 Answers

Chris_Rands

Related questions

Recent Activity

Donate For Us