Do you know of a method for quickly filtering a list of strings to obtain the subset that contain a specified string? The obvious implementation is to just iterate through the list, checking each string for whether it contains the search string. Is there a way to index the string list so that the search can be done faster?
Wikipedia article lists a few ways to index substrings. You've got:
- Suffix tree
- Suffix array
- N-gram index, an inverted file for all N-grams of the text
- Compressed suffix array1
- FM-index
- LZ-index
Yes, you could for example create an index for all character combinations in the strings. A string like "hello" would be added in the indexes for "he", "el", "ll" and "lo". To search for the string "hell" you would get the index of all strings that exist in all of the "he", "el" and "ll" indexes, then loop through those to check for the actual content in the strings.
If you can preprocess the collection then you can do a lot of different things.
For example, you could build a trie including all your strings' suffixes, then use that to do very fast matching.
If you're going to be repeatedly searching the same text, then a suffix tree is probably worthwhile. If carefully applied, you can achieve linear time processing for most string problems. If not, then in practice you won't be able to do much better than Rabin-Karp, which is based on hashing, and is linear in expected time.
There are many freely available implementations of suffix trees. See for example, this C implementation, or for Java, check out the Biojava framework.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With