Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to find if a short string is present in a longer string in python

I have a file of short strings, which I have loaded in a list short (there are 1.5 million short strings of length 150). I want to find the number of these short strings that are present in a longer string (of length ~ 5 million) which is seq in the code. I use the following obvious implementation. However, this seems to take a long time (around a day) to run.

count1=count2=0
for line in short:
    count1+=1
    if line in seq:
            count2+=1
print str(count2) + ' of ' + str(count1) + ' strings are in long string.'

Is there a way I can do this more efficiently?

like image 485
Devil Avatar asked Jun 10 '14 16:06

Devil


2 Answers

If the short strings are a constant length (you indicated they were 150 long), you can preprocess the long string to extract all the short strings, then just do set lookups (which are constant time in expectation):

shortlen = 150
shortset = set()
for i in xrange(len(seq)-shortlen+1):
    shortset.add(seq[i:i+shortlen])

for line in short:
    count1 += 1
    if line in shortset:
        count2 += 1

The running time for this is probably going to be dominated by the preprocessing step (because it inserts nearly 5M strings of length 150 each), but that should still be faster than 1.5M searches in a 5M character string.

like image 194
nneonneo Avatar answered Oct 16 '22 07:10

nneonneo


Do profiling, and try different options. You will not get around iterating through your sequence of "test" strings, so for line in short is something you will most probably keep. The test if line in seq is I think quite efficiently implemented in CPython, but I think this is not optimized for searching a small needle in a laaaaarge haystack. Your requirements are a bit extreme, and I guess that it is exactly this test takes that quite a while and is the bottleneck of your code. You might want to try, just as a comparison, the regex module for searching the needle in the haystack.

Edit:

A rudimentary benchmark (no repetitions, no scaling behavior investigated, no profile module used), for comparing the methods discussed in this thread:

import string
import random
import time


def genstring(N):
    return ''.join(random.choice(string.ascii_uppercase) for _ in xrange(N))


t0 = time.time()
length_longstring = 10**6
length_shortstring = 7
nbr_shortstrings = 3*10**6
shortstrings = [genstring(length_shortstring) for _ in xrange(nbr_shortstrings)]
longstring = genstring(length_longstring)
duration = time.time() - t0
print "Setup duration: %.1f s" % duration


def method_1():
    count1 = 0
    count2 = 0
    for ss in shortstrings:
        count1 += 1
        if ss in longstring:
            count2 += 1
    print str(count2) + ' of ' + str(count1) + ' strings are in long string.'


#t0 = time.time()
#method_1()
#duration = time.time() - t0
#print "M1 duration: %.1f s" % duration


def method_2():
    shortset = set()
    for i in xrange(len(longstring)-length_shortstring+1):
        shortset.add(longstring[i:i+length_shortstring])
    count1 = 0
    count2 = 0
    for ss in shortstrings:
        count1 += 1
        if ss in shortset:
            count2 += 1
    print str(count2) + ' of ' + str(count1) + ' strings are in long string.'


t0 = time.time()
method_2()
duration = time.time() - t0
print "M2 duration: %.1f s" % duration


def method_3():
    shortset = set(
        longstring[i:i+length_shortstring] for i in xrange(
            len(longstring)-length_shortstring+1))
    count1 = len(shortstrings)
    count2 = sum(1 for ss in shortstrings if ss in shortset)
    print str(count2) + ' of ' + str(count1) + ' strings are in long string.'


t0 = time.time()
method_3()
duration = time.time() - t0
print "M3 duration: %.1f s" % duration

Output:

$ python test.py 
Setup duration: 23.3 s
364 of 3000000 strings are in long string.
M2 duration: 1.4 s
364 of 3000000 strings are in long string.
M3 duration: 1.2 s

(This is Python 2.7.3 on Linux, on a E5-2650 0 @ 2.00GHz)

There is a slight difference between the method proposed by nneonneo and the improvements suggested by chepner. Under these conditions, it is already no fun to execute the original code. Under a little less extreme conditions we can make a comparison among all three methods:

length_longstring = 10**6
length_shortstring = 5
nbr_shortstrings = 10**5

->

$ python test1.py 
Setup duration: 1.4 s
8121 of 100000 strings are in long string.
M1 duration: 95.0 s
8121 of 100000 strings are in long string.
M2 duration: 0.4 s
8121 of 100000 strings are in long string.
M3 duration: 0.4 s
like image 33
Dr. Jan-Philip Gehrcke Avatar answered Oct 16 '22 08:10

Dr. Jan-Philip Gehrcke