I have two sets of strings (A
and B
), and I want to know all pairs of strings a in A
and b in B
where a
is a substring of b
.
The first step of coding this was the following:
for a in A:
for b in B:
if a in b:
print (a,b)
However, I wanted to know-- is there a more efficient way to do this with regular expressions (e.g. instead of checking if a in b:
, check if the regexp '.*' + a + '.*':
matches 'b'. I thought that maybe using something like this would let me cache the Knuth-Morris-Pratt failure function for all a
. Also, using a list comprehension for the inner for b in B:
loop will likely give a pretty big speedup (and a nested list comprehension may be even better).
I'm not very interested in making a giant leap in the asymptotic runtime of the algorithm (e.g. using a suffix tree or anything else complex and clever). I'm more concerned with the constant (I just need to do this for several pairs of A
and B
sets, and I don't want it to run all week).
Do you know any tricks or have any generic advice to do this more quickly? Thanks a lot for any insight you can share!
Edit:
Using the advice of @ninjagecko and @Sven Marnach, I built a quick prefix table of 10-mers:
import collections
prefix_table = collections.defaultdict(set)
for k, b in enumerate(B):
for i in xrange(len(prot_seq)-10):
j = i+10+1
prefix_table[b[i:j]].add(k)
for a in A:
if len(a) >= 10:
for k in prefix_table[a[:10]]:
# check if a is in b
# (missing_edges is necessary, but not sufficient)
if a in B[k]:
print (a,b)
else:
for k in xrange(len(prots_and_seqs)):
# a is too small to use the table; check if
# a is in any b
if a in B[k]:
print (a, b)
Of course you can easily write this as a list comprehension:
[(a, b) for a in A for b in B if a in b]
This might slightly speed up the loop, but don't expect too much. I doubt using regular expressions will help in any way with this one.
Edit: Here are some timings:
import itertools
import timeit
import re
import collections
with open("/usr/share/dict/british-english") as f:
A = [s.strip() for s in itertools.islice(f, 28000, 30000)]
B = [s.strip() for s in itertools.islice(f, 23000, 25000)]
def f():
result = []
for a in A:
for b in B:
if a in b:
result.append((a, b))
return result
def g():
return [(a, b) for a in A for b in B if a in b]
def h():
res = [re.compile(re.escape(a)) for a in A]
return [(a, b) for a in res for b in B if a.search(b)]
def ninjagecko():
d = collections.defaultdict(set)
for k, b in enumerate(B):
for i, j in itertools.combinations(range(len(b) + 1), 2):
d[b[i:j]].add(k)
return [(a, B[k]) for a in A for k in d[a]]
print "Nested loop", timeit.repeat(f, number=1)
print "List comprehension", timeit.repeat(g, number=1)
print "Regular expressions", timeit.repeat(h, number=1)
print "ninjagecko", timeit.repeat(ninjagecko, number=1)
Results:
Nested loop [0.3641810417175293, 0.36279606819152832, 0.36295199394226074]
List comprehension [0.362030029296875, 0.36148500442504883, 0.36158299446105957]
Regular expressions [1.6498990058898926, 1.6494300365447998, 1.6480278968811035]
ninjagecko [0.06402897834777832, 0.063711881637573242, 0.06389307975769043]
Edit 2: Added a variant of the alogrithm suggested by ninjagecko to the timings. You can see it is much better than all the brute force approaches.
Edit 3: Used sets instead of lists to eliminate the duplicates. (I did not update the timings -- they remained essentially unchanged.)
Let's assume your words are bounded at a reasonable size (let's say 10 letters). Do the following to achieve linear(!) time complexity, that is, O(A+B)
:
55*O(B)
=O(B)
), with metadata of which string it belonged toO(1)
query to your hashtable/trie to find all B-strings it is in, yield those(As of writing this answer, no response yet if OP's "words" are bounded. If they are unbounded, this solution still applies, but there is a dependency of O(maxwordsize^2)
, though actually it's nicer in practice since not all words are the same size, so it might be as nice as O(averagewordsize^2)
with the right distribution. For example if all the words were of size 20, the problem size would grow by a factor of 4 more than if they were of size 10. But if sufficiently few words were increased from size 10->20, then the complexity wouldn't change much.)
edit: https://stackoverflow.com/q/8289199/711085 is actually a theoretically better answer. I was looking at the linked Wikipedia page before that answer was posted, and was thinking "linear in the string size is not what you want", and only later realized it's exactly what you want. Your intuition to build a regexp (Aword1|Aword2|Aword3|...)
is correct since the finite-automaton which is generated behind the scenes will perform matching quickly IF it supports simultaneous overlapping matches, which not all regexp engines might. Ultimately what you should use depends on if you plan to reuse the As or Bs, or if this is just a one-time thing. The above technique is much easier to implement but only works if your words are bounded (and introduces a DoS vulnerability if you don't reject words above a certain size limit), but may be what you are looking for if you don't want the Aho-Corasick string matching finite automaton or similar, or it is unavailable as a library.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With