String manipulation appears to be inefficient

Question

I think my code is too inefficient. I'm guessing it has something to do with using strings, though I'm unsure. Here is the code:

genome = FASTAdata[1]
genomeLength = len(genome);

# Hash table holding all the k-mers we will come across
kmers = dict()

# We go through all the possible k-mers by index
for outer in range (0, genomeLength-1):
    for inner in range (outer+2, outer+22):
        substring = genome[outer:inner]
        if substring in kmers:              # if we already have this substring on record, increase its value (count of num of appearances) by 1
            kmers[substring] += 1
        else:
            kmers[substring] = 1            # otherwise record that it's here once

This is to search through all substrings of length at most 20. Now this code seems to take pretty forever and never terminate, so something has to be wrong here. Is using [:] on strings causing the huge overhead? And if so, what can I replace it with?

And for clarity the file in question is nearly 200mb, so pretty big.

Tommy · Accepted Answer

I would recommend using a dynamic programming algorithm. The problem is that for all inner strings that are not found, you are re-searching those again with extra characters appended onto them, so of course those will also not be found. I do not have a specific algorithm in mind, but this is certainly a case for dynamic programming where you remember what you have already searched for. As a really crummy example, remember all substrings of length 1,2,3,... that are not found, and never extend those bases in the next iteration where the strings are only longer.

String manipulation appears to be inefficient

Tags:

python

user2964780

1 Answers

Tommy

Recent Activity

Donate For Us

String manipulation appears to be inefficient

Tags:

python

user2964780

1 Answers

Tommy

Related questions

Recent Activity

Donate For Us