Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to check existence in a large set of strings

I have a set of 100+ million strings, each up to 63 characters long. I have lots of disk space and very little memory (512 MB). I need to query for existence alone, and store no additional metadata.

My de facto solution is a BDB btree. Are there any preferable alternatives? I'm aware of leveldb and Kyoto Cabinet, but not familiar enough to identify advantages.

like image 880
Drew Sears Avatar asked Nov 15 '12 20:11

Drew Sears


2 Answers

If false positives are acceptable, then one possible solution would be to use a bloom filter. Bloom filters are similar to hash tables, but instead of using one hash value to index a table of buckets, it uses multiple hashes to index a bit array. The bits corresponding to those indices are set. Then, to test if a string is in the filter, the string is hashed again, and if the corresponding indices are set, then the string is "in" the filter.

It doesn't store any information about the strings, so it uses very little memory -- but if there's a collision between two strings, no collision resolution is possible. This means there may be false positives (because a string not in the filter may hash to the same indices as a string that is in the filter). However, there can be no false negatives; any string that really is in the set will be found in the bloom filter.

There are a few Python implementations. It's also not hard to roll your own; I recall coding up a quick-and-dirty bloom filter myself once using bitarrays that performed pretty well.

like image 152
senderle Avatar answered Sep 20 '22 03:09

senderle


You said you have lots of disk, huh? One option would be to store the strings as filenames in nested subdirectories. You could either use the strings directly:

  • Store "drew sears" in d/r/e/w/ sears

or by taking a hash of the string and following a similar process:

  • MD5('drew sears') = 'f010fe6e20d12ed895c10b93b2f81c6e'
  • Create an empty file named f0/10/fe/6e/20d12ed895c10b93b2f81c6e.

Think of it as an OS-optimized, hash-table based indexed NoSQL database.

Side benefits:

  • You could change your mind at a later time and store data in the file.
  • You can replicate your database to another system using rsync.
like image 44
Kirk Strauser Avatar answered Sep 21 '22 03:09

Kirk Strauser