Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Substring search algorithms (very large haystack, small needle)

I know there are already several similar questions here, but I need some recommendations for my case (couldn't find anything similar).

I have to search a very large amount of data for a substring that would be about a billion times smaller (10 bytes in 10 billion bytes). The haystack doesn't change, so I can bear with large precomputation if needed. I just need the searching part to be as fast as possible.

I found algorithms which take O(n) time (n = haystack size, m = needle size), and the naive search takes O(n+m). As the m in this particular case would be very small, is there any other algorithm I could look into?

Edit: Thanks everyone for your suggestions! Some more info - The data could be considered random bits, so I don't think any kind of indexing / sorting would be possible. The data to be searched can be anything, not english words or anything predictable.

like image 317
Dogbert Avatar asked Aug 07 '10 23:08

Dogbert


2 Answers

You are looking for the data structure called the Trie or "prefix tree". In short, this data structure encodes all the possible string prefixes which can be found in your corpus.

Here is a paper which searches a DNA sequences for a small substring, using a prefix tree. I imagine that might help you, since your case sounds similar.

If you know a definite limit on the length of the input search string, you can limit the growth of your Trie so that it does not store any prefixes longer than this length max. In this way, you may be able to fit a Trie representing all 10G into less than 10G of memory. Especially for highly repetitive data, any sort of Trie is a compressed data representation. (Or should be, if implemented sanely.) Limiting the Trie depth to the max input search string allows you to limit the memory consumption still further.

like image 65
Heath Hunnicutt Avatar answered Oct 22 '22 20:10

Heath Hunnicutt


It's worth looking at suffix arrays and trees. They both require precomputation and significant memory, but they are better than reverse indexes in the sense that you can search for arbitrary substrings in O(m) (for suffix trees) and O(m + log n) (for suffix arrays with least common prefix info).

If you have a lot of time on your hands, you can look into compressed suffix arrays and succinct CSAs that are compressed versions of your data that are also self-indexing (i.e. the data is also the index, there is no external index). This is really the best of all worlds because not only do you have a compressed version of your data (you can throw the original data away), but it's also indexed! The problem is understanding the research papers and translating them into code :)

If you do not need perfect substring matching, but rather general searching capabilities, check out Lucene.

like image 29
Chris Schmich Avatar answered Oct 22 '22 19:10

Chris Schmich