Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String manipulation appears to be inefficient

Tags:

python

I think my code is too inefficient. I'm guessing it has something to do with using strings, though I'm unsure. Here is the code:

genome = FASTAdata[1]
genomeLength = len(genome);

# Hash table holding all the k-mers we will come across
kmers = dict()

# We go through all the possible k-mers by index
for outer in range (0, genomeLength-1):
    for inner in range (outer+2, outer+22):
        substring = genome[outer:inner]
        if substring in kmers:              # if we already have this substring on record, increase its value (count of num of appearances) by 1
            kmers[substring] += 1
        else:
            kmers[substring] = 1            # otherwise record that it's here once

This is to search through all substrings of length at most 20. Now this code seems to take pretty forever and never terminate, so something has to be wrong here. Is using [:] on strings causing the huge overhead? And if so, what can I replace it with?

And for clarity the file in question is nearly 200mb, so pretty big.

like image 898
user2964780 Avatar asked Nov 07 '13 12:11

user2964780


1 Answers

I would recommend using a dynamic programming algorithm. The problem is that for all inner strings that are not found, you are re-searching those again with extra characters appended onto them, so of course those will also not be found. I do not have a specific algorithm in mind, but this is certainly a case for dynamic programming where you remember what you have already searched for. As a really crummy example, remember all substrings of length 1,2,3,... that are not found, and never extend those bases in the next iteration where the strings are only longer.

like image 92
Tommy Avatar answered Oct 17 '22 03:10

Tommy