I have a set of 100+ million strings, each up to 63 characters long. I have lots of disk space and very little memory (512 MB). I need to query for existence alone, and store no additional metadata.
My de facto solution is a BDB btree. Are there any preferable alternatives? I'm aware of leveldb and Kyoto Cabinet, but not familiar enough to identify advantages.
If false positives are acceptable, then one possible solution would be to use a bloom filter. Bloom filters are similar to hash tables, but instead of using one hash value to index a table of buckets, it uses multiple hashes to index a bit array. The bits corresponding to those indices are set. Then, to test if a string is in the filter, the string is hashed again, and if the corresponding indices are set, then the string is "in" the filter.
It doesn't store any information about the strings, so it uses very little memory -- but if there's a collision between two strings, no collision resolution is possible. This means there may be false positives (because a string not in the filter may hash to the same indices as a string that is in the filter). However, there can be no false negatives; any string that really is in the set will be found in the bloom filter.
There are a few Python implementations. It's also not hard to roll your own; I recall coding up a quick-and-dirty bloom filter myself once using bitarray
s that performed pretty well.
You said you have lots of disk, huh? One option would be to store the strings as filenames in nested subdirectories. You could either use the strings directly:
d/r/e/w/ sears
or by taking a hash of the string and following a similar process:
f0/10/fe/6e/20d12ed895c10b93b2f81c6e
.Think of it as an OS-optimized, hash-table based indexed NoSQL database.
Side benefits:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With