I have a file of short strings, which I have loaded in a list short
(there are 1.5 million short strings of length 150). I want to find the number of these short strings that are present in a longer string (of length ~ 5 million) which is seq
in the code. I use the following obvious implementation. However, this seems to take a long time (around a day) to run.
count1=count2=0
for line in short:
count1+=1
if line in seq:
count2+=1
print str(count2) + ' of ' + str(count1) + ' strings are in long string.'
Is there a way I can do this more efficiently?
If the short
strings are a constant length (you indicated they were 150 long), you can preprocess the long string to extract all the short strings, then just do set lookups (which are constant time in expectation):
shortlen = 150
shortset = set()
for i in xrange(len(seq)-shortlen+1):
shortset.add(seq[i:i+shortlen])
for line in short:
count1 += 1
if line in shortset:
count2 += 1
The running time for this is probably going to be dominated by the preprocessing step (because it inserts nearly 5M strings of length 150 each), but that should still be faster than 1.5M searches in a 5M character string.
Do profiling, and try different options. You will not get around iterating through your sequence of "test" strings, so for line in short
is something you will most probably keep. The test if line in seq
is I think quite efficiently implemented in CPython, but I think this is not optimized for searching a small needle in a laaaaarge haystack. Your requirements are a bit extreme, and I guess that it is exactly this test takes that quite a while and is the bottleneck of your code. You might want to try, just as a comparison, the regex
module for searching the needle in the haystack.
Edit:
A rudimentary benchmark (no repetitions, no scaling behavior investigated, no profile module used), for comparing the methods discussed in this thread:
import string
import random
import time
def genstring(N):
return ''.join(random.choice(string.ascii_uppercase) for _ in xrange(N))
t0 = time.time()
length_longstring = 10**6
length_shortstring = 7
nbr_shortstrings = 3*10**6
shortstrings = [genstring(length_shortstring) for _ in xrange(nbr_shortstrings)]
longstring = genstring(length_longstring)
duration = time.time() - t0
print "Setup duration: %.1f s" % duration
def method_1():
count1 = 0
count2 = 0
for ss in shortstrings:
count1 += 1
if ss in longstring:
count2 += 1
print str(count2) + ' of ' + str(count1) + ' strings are in long string.'
#t0 = time.time()
#method_1()
#duration = time.time() - t0
#print "M1 duration: %.1f s" % duration
def method_2():
shortset = set()
for i in xrange(len(longstring)-length_shortstring+1):
shortset.add(longstring[i:i+length_shortstring])
count1 = 0
count2 = 0
for ss in shortstrings:
count1 += 1
if ss in shortset:
count2 += 1
print str(count2) + ' of ' + str(count1) + ' strings are in long string.'
t0 = time.time()
method_2()
duration = time.time() - t0
print "M2 duration: %.1f s" % duration
def method_3():
shortset = set(
longstring[i:i+length_shortstring] for i in xrange(
len(longstring)-length_shortstring+1))
count1 = len(shortstrings)
count2 = sum(1 for ss in shortstrings if ss in shortset)
print str(count2) + ' of ' + str(count1) + ' strings are in long string.'
t0 = time.time()
method_3()
duration = time.time() - t0
print "M3 duration: %.1f s" % duration
Output:
$ python test.py
Setup duration: 23.3 s
364 of 3000000 strings are in long string.
M2 duration: 1.4 s
364 of 3000000 strings are in long string.
M3 duration: 1.2 s
(This is Python 2.7.3 on Linux, on a E5-2650 0 @ 2.00GHz)
There is a slight difference between the method proposed by nneonneo and the improvements suggested by chepner. Under these conditions, it is already no fun to execute the original code. Under a little less extreme conditions we can make a comparison among all three methods:
length_longstring = 10**6
length_shortstring = 5
nbr_shortstrings = 10**5
->
$ python test1.py
Setup duration: 1.4 s
8121 of 100000 strings are in long string.
M1 duration: 95.0 s
8121 of 100000 strings are in long string.
M2 duration: 0.4 s
8121 of 100000 strings are in long string.
M3 duration: 0.4 s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With