Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast filtering of a string collection by substring?

Do you know of a method for quickly filtering a list of strings to obtain the subset that contain a specified string? The obvious implementation is to just iterate through the list, checking each string for whether it contains the search string. Is there a way to index the string list so that the search can be done faster?

like image 472
mackenir Avatar asked Aug 19 '09 11:08

mackenir


4 Answers

Wikipedia article lists a few ways to index substrings. You've got:

  • Suffix tree
  • Suffix array
  • N-gram index, an inverted file for all N-grams of the text
  • Compressed suffix array1
  • FM-index
  • LZ-index
like image 183
Mark Rushakoff Avatar answered Nov 04 '22 06:11

Mark Rushakoff


Yes, you could for example create an index for all character combinations in the strings. A string like "hello" would be added in the indexes for "he", "el", "ll" and "lo". To search for the string "hell" you would get the index of all strings that exist in all of the "he", "el" and "ll" indexes, then loop through those to check for the actual content in the strings.

like image 22
Guffa Avatar answered Nov 04 '22 04:11

Guffa


If you can preprocess the collection then you can do a lot of different things.

For example, you could build a trie including all your strings' suffixes, then use that to do very fast matching.

like image 33
Craig Gidney Avatar answered Nov 04 '22 05:11

Craig Gidney


If you're going to be repeatedly searching the same text, then a suffix tree is probably worthwhile. If carefully applied, you can achieve linear time processing for most string problems. If not, then in practice you won't be able to do much better than Rabin-Karp, which is based on hashing, and is linear in expected time.

There are many freely available implementations of suffix trees. See for example, this C implementation, or for Java, check out the Biojava framework.

like image 1
ire_and_curses Avatar answered Nov 04 '22 06:11

ire_and_curses