I have sets of hashes (first 64 bits of MD5, so they're distributed very randomly) and I want to be able to see if a new hash is in a set, and to add it to a set. Sets aren't too big, the largest will be millions of elements, but there are hundreds of sets, so I cannot hold them all in memory. Some ideas I had so far: <ul> <li>I tried just keeping it all in sqlite table, but it becomes really really slow once it cannot fit everything in memory.</li> <li>Bloom filters sound like they would have very high error rate. I don't mind tiny error rate (64 bit hash gives 1 collision on 4G element set already), but error rates like 1% are a lot too high.</li> <li>Keep sorted list of hashes with gaps in a file, and resize when I don't have enough gaps. Hashes are uniformly distributed, so even very simple scheme like that should work.</li> </ul> Am I missing something really obvious? Any hints how to implement good disk-based hashtable?

Here's the solution I eventually used: <ul> <li>One file per set</li> <li>File contains 2^k buckets, each 256 bytes or 32 entries of 8 bytes</li> <li>Empty entries are just zeroed out (000... is a valid hash, but I don't care about 2^-64 chance of collision, if everything can collide with everything else already, by the nature of hashing).</li> <li>Every hash resides in bucket guessed via its first k bits</li> <li>If any bucket overflows, double file size and split every bucket</li> <li>Everything is accessed via mmap(), not read()/write()</li> </ul> It's just unbelievably faster than sqlite, even though it's low-level Perl code, and Perl really isn't meant for high performance databases. It will not work with anything that's less uniformly distributed than MD5, its assuming everything will be extremely uniform to keep the implementation simple. I tried it with seek()/sysread()/syswrite() at first, and it was very slow, mmap() version is really a lot faster.

I had some trouble picturing your exact problem/need, but it still got me thinking about Git and how it stores SHA1-references on disk: Take the hexadecimal string representation of a given hash, say, "<code>abfab0da6f4ebc23cb15e04ff500ed54</code>". Chop the two first characters in the hash ("<code>ab</code>", in our case) and make it into a directory. Then, use the rest ("<code>fab0da6f4ebc23cb15e04ff500ed54</code>"), create the file, and put stuff in it. This way, you get pretty decent performance on-disk (depending on your FS, naturally) with an automatic indexing. Additionally, you get direct access to any known hash, just by wedging a directory delimiter after the two first chars ("<code>./ab/fab0da</code>[..]") I'm sorry if I missed the ball entirely, but with any luck, this might give you an idea.

Fast disk-based hashtables?

2 Answers

Here's the solution I eventually used:

One file per set
File contains 2^k buckets, each 256 bytes or 32 entries of 8 bytes
Empty entries are just zeroed out (000... is a valid hash, but I don't care about 2^-64 chance of collision, if everything can collide with everything else already, by the nature of hashing).
Every hash resides in bucket guessed via its first k bits
If any bucket overflows, double file size and split every bucket
Everything is accessed via mmap(), not read()/write()

It's just unbelievably faster than sqlite, even though it's low-level Perl code, and Perl really isn't meant for high performance databases. It will not work with anything that's less uniformly distributed than MD5, its assuming everything will be extremely uniform to keep the implementation simple.

I tried it with seek()/sysread()/syswrite() at first, and it was very slow, mmap() version is really a lot faster.

153

answered Oct 01 '22 20:10

taw

I had some trouble picturing your exact problem/need, but it still got me thinking about Git and how it stores SHA1-references on disk:

Take the hexadecimal string representation of a given hash, say, "abfab0da6f4ebc23cb15e04ff500ed54". Chop the two first characters in the hash ("ab", in our case) and make it into a directory. Then, use the rest ("fab0da6f4ebc23cb15e04ff500ed54"), create the file, and put stuff in it.

This way, you get pretty decent performance on-disk (depending on your FS, naturally) with an automatic indexing. Additionally, you get direct access to any known hash, just by wedging a directory delimiter after the two first chars ("./ab/fab0da[..]")

I'm sorry if I missed the ball entirely, but with any luck, this might give you an idea.

answered Oct 01 '22 20:10

Henrik Paul

Related questions
                            
                                Using Hashtables/Dictionaries with string keys & Case Insensitive Searching
                            
                                How HashTable and HashMap key-value are stored in the memory?
                            
                                Hash table faster in C# than C++?
                            
                                Efficient data structure for a leaderboard, i.e., a list of records (name, points) - Efficient Search(name), Search(rank) and Update(points)
                            
                                c++ - unordered_map complexity
                            
                                Hash tables v self-balancing search trees
                            
                                How can I convert List<object> to Hashtable in C#?
                            
                                How to store a hash table in a file?
                            
                                Haskell mutable map/tree
                            
                                How to check if an associative array is empty in powershell
                            
                                how does the hashCode() method of java works?
                            
                                Rehashing process in hashmap or hashtable
                            
                                Why deletion of elements of hash table using doubly-linked list is O(1)?
                            
                                How does one implement hash tables in a functional language?
                            
                                Jump Table Switch Case question
                            
                                How are hash tables implemented internally in popular languages?
                            
                                Which is faster to find an item in a hashtable or in a sorted list?
                            
                                hash function providing unique uint from an integer coordinate pair
                            
                                Multi-valued hashtable in Java
                            
                                How to write a hash function in C?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fast disk-based hashtables?

Tags:

hashtable

taw

People also ask

2 Answers

taw

Henrik Paul

Recent Activity

Donate For Us