Fast filtering of a string collection by substring?

Question

Do you know of a method for quickly filtering a list of strings to obtain the subset that contain a specified string? The obvious implementation is to just iterate through the list, checking each string for whether it contains the search string. Is there a way to index the string list so that the search can be done faster?

Mark Rushakoff · Accepted Answer

Wikipedia article lists a few ways to index substrings. You've got:

Suffix tree

Suffix array

N-gram index, an inverted file for all N-grams of the text

Compressed suffix array1

FM-index

LZ-index

Guffa · Answer

Yes, you could for example create an index for all character combinations in the strings. A string like "hello" would be added in the indexes for "he", "el", "ll" and "lo". To search for the string "hell" you would get the index of all strings that exist in all of the "he", "el" and "ll" indexes, then loop through those to check for the actual content in the strings.

Craig Gidney · Answer

If you can preprocess the collection then you can do a lot of different things.

For example, you could build a trie including all your strings' suffixes, then use that to do very fast matching.

ire_and_curses · Answer

If you're going to be repeatedly searching the same text, then a suffix tree is probably worthwhile. If carefully applied, you can achieve linear time processing for most string problems. If not, then in practice you won't be able to do much better than Rabin-Karp, which is based on hashing, and is linear in expected time.

There are many freely available implementations of suffix trees. See for example, this C implementation, or for Java, check out the Biojava framework.

Fast filtering of a string collection by substring?

Tags:

string

substring

algorithm

search

mackenir

4 Answers

Mark Rushakoff

Guffa

Craig Gidney

ire_and_curses

Recent Activity

Donate For Us

Fast filtering of a string collection by substring?

Tags:

string

substring

algorithm

search

mackenir

4 Answers

Mark Rushakoff

Guffa

Craig Gidney

ire_and_curses

Related questions

Recent Activity

Donate For Us