Assume I have a set of strings S and a query string q. I want to know if any member of S is a substring of q. (For the purpose of this question substring includes equality, e.g. "foo" is a substring of "foo".) For example assume the function that does what I want is called anySubstring
:
S = ["foo", "baz"]
q = "foobar"
assert anySubstring(S, q) # "foo" is a substring of "foobar"
S = ["waldo", "baz"]
assert not anySubstring(S, q)
Is there any easy-to-implement algorithm to do this with time complexity sublinear in len(S)
? It's ok if S has to be processed into some clever data structure first because I will be querying each S with a lot of q strings, so the amortized cost of this preprocessing might be reasonable.
EDIT: To clarify, I don't care which member of S is a substring of q, only whether at least one is. In other words, I only care about a boolean answer.
I think Aho-Corasick algorithm does what you want. I think there is another solution which is very simple to implement, it's Karp-Rabin algorithm.
So if the length of S is way less then the sum of the lengths of the potential substrings your best option would be to build a suffix tree from S and then do a search in it. This is linear with respect to the length of S plus the summar length of the candidate substrings. Of course there can not be an algorithm with better complexity as you have to pass through all the input at least. If the case is opposite i.e. the length of s is more then the summar length of the substrings your best option would be aho-corasick.
Hope this helps.
Create a regular expression .*(S1|S2|...|Sn).*
and construct its minimal DFA.
Run your query string through the DFA.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With