I have a problem that I am looking for some guidance to solve the most efficient way. I have 200 million strings of data ranging in size from 3 characters to 70 characters. The strings consist of letters numbers and several special characters such as dashes and underscores. I need to be able to quickly search for the entire string or any substring within a string (minimum substring size is 3). Quickly is defined here as less than 1 second.
As my first cut at this I did the following:
Created 38 index files. An index contains all the substrings that start with a particular letter. The first 4mb contains 1 million hash buckets (start of the hash chains). The rest of the index contains the linked list chains from the hash buckets. My hashing is very evenly distributed. The 1 million hash buckets are kept in RAM and mirrored to disk.
When a string is added to the index it is broken down into its non-duplicate (within itself) 3-n character substrings (when n is the length of the string-1). So, for example, "apples" is stored in the "A" index as pples,pple,ppl,pp (substrings are also stored in the "L" and "P" indexes).
The search/add server runs as a daemon (in C++) and works like a champ. Typical search times are less than 1/2 second.
The problem is on the front end of the process. I typically add 30,000 keys at a time. This part of the process takes forever. By way of benchmark, the load time into an empty index of 180,000 variable length keys is approximately 3 1/2 hours.
This scheme works except for the very long load times.
Before I go nuts optimizing (or trying to) I'm wondering is whether or not there is a better way to solve this problem. Front and back wildcard searches (ie: string like '%ppl%' in a DBMS are amazingly slow (on the order of hours in MySQL for example) for datasets this large. So it would seem that DBMS solutions are out of the question. I can't use full-text searches because we are not dealing with normal words, but strings that may or may not be composed of real words.
From your description, the loading of data takes all that time because you're dealing with I/O, mirroring the inflated strings to hard disk. This will definitely be a bottleneck, mainly depending on the way you read and write data to the disk.
A possible improvement on execution time may be achieved using mmap
with some LRU policy. I'm quite sure the idea of replicating data is to make the search faster, but since you're using -- as it seems to be -- only one machine, you're bottleneck will go dive from memory searching to I/O requests.
Another solution, which you may not be interested in -- it's sickly funny and disturbing as well (: --, is to split the data among multiple machines. Considering the way you've structured the data, the implementation itself may take a bit of time, but it would be very straightforward. You'd have:
hash_id(bucket) % num_machines
;Another good point is that, as you said, data is evenly distributed -- ALREADY \o/; this is usually one of the pickiest parts of a distributed implementation. Besides, this would be highly scalable, as you may add another machine whenever data grows in size.
Instead of doing everything in one pass, solve the problem in 38 passes.
Read each of the 180,000 strings. Find "A"s in each string, and write out stuff only to the "A" hash table. After you are done, write the entire finished result of the "A" hash table out to disk. (have enough RAM to store the entire "A" hash table in memory -- if you don't, make smaller hash tables. Ie, have 38^2 hash tables on pairs of starting letters, and have 1444 different tables. You could even dynamically change how many letters the hash tables are keyed off of have based on how common a prefix they are, so they are all of modest size. Keeping track of how long such prefixes are isn't expensive.)
Then read each of the 180,000 strings, looking for "B". Etc.
My theory is that you are going slower than you could because of thrashing of your cache of your massive tables.
The next thing that might help is to limit how long the strings are you do a hash on, in order to shrink the size of your tables.
Instead of doing all 2278 substrings of length 3 to 70 of a string of length 70, if you limited the length of the hash to 10 characters there are only 508 substrings of length 3 to 10. And there may not be that many collisions on strings of length longer than 10. You could, again, have the length of the hashes be dynamic -- the length X hash might have a flag for "try a length X+Y hash if your string is longer than X, this is too common", and otherwise simply terminate the hashing. That could reduce the amount of data in your tables, at the cost of slower lookup in some cases.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With