Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

whats the fastest string collection structure/algorithm for startswith and/or contains searches

I have the following situation: I have a big collection of strings (lets say 250.000+) of average length of maybe 30. What I have to do is to do many searches within these .. mostly those will be of StartsWith and Contains kind.

The collection is static at runtime. Which means the initial reading and filling of the collection of choice is done only once .. therefore the performance of building the datastructure is absolutely not important. Memory is also not a problem: which also means that I don't mind having two collections with the same data in each if needed (like one for the startswith and another for contains). Only thing that matters is performance of the searches which should return all elements which match the searchcondition.

For startswith I came upon a Trie or Radix-tree .. but maybe there are even better choices?

For contains .. I have no good idea yet at all (beside running a linq query on a list which wont be very fast with that amount of data).

Thanks in advance everyone!

update: I forgot an important part: with Contains i mean no exact matches in the collection .. but i want to find all strings in the collection which contain the given searchstring

like image 347
Mikk Avatar asked Mar 03 '13 22:03

Mikk


1 Answers

Building a suffix tree will allow you to do a substring search on all your strings in parallel in O(1). The pedantic in me can't help but note that it's really O(n + m) where n is the number of strings that match your substring and m is the size of the substring being queried.

So what's a suffix tree you ask? In its most basic implementation, it's a trie with a fancier insert method: in addition to adding a string, it also adds every possible suffix of that string to the trie. On this data structure, a substring search becomes a prefix search of all possible suffixes. Since you also want to do prefix searches, you'll want to add a special character in front of each inserted string and the query substrings. The special character will allow you to differentiate between a suffix and a full string.

While this implementation of a suffix tree is remarkably simple, it's also very inefficient (O(n^2) space and build time). Fortunately, there are other more efficient implementations which can greatly reduce the space and time bounds. One of these, Ukkonen's algorithm, is very well explained in this SO answer and brings the space bound down to O(n). You may also want to look into suffix arrays which are an equivalent but more efficient representation of suffix trees.

While I know there are many many more implementations of suffix trees out there (one of which would probably hit the sweet spot for your use case) I just don't know them. I recommend you do some research on the subject before you settle on an implementation.

like image 182
Ze Blob Avatar answered Oct 20 '22 03:10

Ze Blob