I have sets of hashes (first 64 bits of MD5, so they're distributed very randomly) and I want to be able to see if a new hash is in a set, and to add it to a set.
Sets aren't too big, the largest will be millions of elements, but there are hundreds of sets, so I cannot hold them all in memory.
Some ideas I had so far:
Am I missing something really obvious? Any hints how to implement good disk-based hashtable?
Simply put, using a hash table is faster than searching through an array. In the Find the First Non-Repeating Character algorithm challenge, we use hash tables as an optimal solution compared to nested for loops, which is a reduction in complexity from O(n*n) to O(n).
I did it with a very simple hashtable stored on disk and accessed via mmap() - 32 entries per bucket, doubling file size on overflow of any bucket - it's just unbelievably faster than sqlite, even though I implemented it in Perl, which really isn't meant for stuff like that.
Storing an open hash table on disk in an efficient way is difficult, because members of a given linked list might be stored on different disk blocks. This would result in multiple disk accesses when searching for a particular key value, which defeats the purpose of using hashing.
The most memory efficient datastructure for associations The hash table with the best memory efficiency is simply the one with the highest load factor, (it can even exceed 100% memory efficiency by using key compression with compact hashing ). A hash table like that does still provide O(1) lookups, just very slow.
Here's the solution I eventually used:
It's just unbelievably faster than sqlite, even though it's low-level Perl code, and Perl really isn't meant for high performance databases. It will not work with anything that's less uniformly distributed than MD5, its assuming everything will be extremely uniform to keep the implementation simple.
I tried it with seek()/sysread()/syswrite() at first, and it was very slow, mmap() version is really a lot faster.
I had some trouble picturing your exact problem/need, but it still got me thinking about Git and how it stores SHA1-references on disk:
Take the hexadecimal string representation of a given hash, say, "abfab0da6f4ebc23cb15e04ff500ed54". Chop the two first characters in the hash ("ab", in our case) and make it into a directory. Then, use the rest ("fab0da6f4ebc23cb15e04ff500ed54"), create the file, and put stuff in it.
This way, you get pretty decent performance on-disk (depending on your FS, naturally) with an automatic indexing. Additionally, you get direct access to any known hash, just by wedging a directory delimiter after the two first chars ("./ab/fab0da[..]")
I'm sorry if I missed the ball entirely, but with any luck, this might give you an idea.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With