I am trying to get my head around the details of a HAMT. I'd have implemented one myself in Java just to understand. I am familiar with Tries and I think I get the main concept of the HAMT.
Basically,
Two types of nodes:
Key/Value
Key Value Node:
K key
V value
Index
Index Node:
int bitmap (32 bits)
Node[] table (max length of 32)
(0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-31)
note: the last step (7th) is only 2 bits. integer==5 bitmap==00001
integer==6 bitmap==0101010101 index==3
The part I don't quite understand is collision detection and mitigation. In the linked paper he alludes to it:
The existing key is then inserted in the new sub-hash table and the new key added. Each time 5 more bits of the hash are used the probability of a collision reduces by a factor of 1/32. Occasionally an entire 32 bit hash may be consumed and a new one must be computed to differentiate the two keys.
If I were to compute a "new" hash and store the object at that new hash; how would you ever be able to look-up the object in the structure? When doing a look-up, wouldn't it generate the "initial" hash and not the "re-computed hash".
I must be missing something.....
BTW: The HAMT performs fairly well, it's sits between a hash map and tree map in my tests.
Data Structure Add time Remove time Sorted add time Sorted remove time Lookup time Size
Java's Hash Map 38.67 ms 18 ms 30 ms 15 ms 16.33 ms 8.19 MB
HAMT 68.67 ms 30 ms 57.33 ms 35.33 ms 33 ms 10.18 MB
Java's Tree Map 86.33 ms 78 ms 75 ms 39.33 ms 76 ms 8.79 MB
Java's Skip List Map 111.33 ms 106 ms 72 ms 41 ms 72.33 ms 9.19 MB
There's two sections of the paper I think you might of missed. The first is the bit immediately preceding the bit you quoted:
Or the key will collide with an existing one. In which case the existing key must be replaced with a sub-hash table and the next 5 bit hash of the existing key computed. If there is still a collision then this process is repeated until no collision occurs.
So if you have object A in the table and you add object B which clashes, the cell at which their keys clashed will be a pointer to another subtable (where they don't clash).
Next, Section 3.7 of the paper you linked describes the method for generating a new hash when you run off the end of your first 32 bits:
The hash function was tailored to give a 32 bit hash. The algorithm requires that the hash can be extended to an arbitrary number of bits. This was accomplished by rehashing the key combined with an integer representing the trie level, zero being the root. Hence if two keys do give the same initial hash then the rehash has a probability of 1 in 2^32 of a further collision.
If this doesn't seem to explain anything, say and I'll extend this answer with more detail.
HAMT is great and highly performant structure especially when one needs immutable objects, i.e. each time after any modification a new copy of a data structure is created!
As for your question on hash collisions, I have found a C# implementation (which is buggy now) that shows how it works: on each hash collision the key is rehashed and lookup is retried recursively until maximum iterations limit is reached.
Currently I am also exploring HAMP in functional programming context and learning existing code. There are several reference implementations of HAMT in Haskell as Data.HshMap and in Clojure as PersistenceHashMap.
There are some other simpler implementations on the web that do not deal with collisions, but they are useful to understand the concept. Here they are in Haskell and OCaml
I have found a nice summary article article that describes HAMT with pictures and links to original research papers by Phil Bagwell.
Related points:
While implementing HAMT in F# I have noticed that popCount function implementation described here really matters and gives 10-15% compared to naive implementation described in the next answers in the link. Not great, but a free lunch.
Related IntMap structures (Haskell and its port to F#) are very good when the key could be an integer and they implement related PATRICIA/Radix trie.
I believe all these implementation are very good to learn efficient immutable data structure and functional languages in all their beauty on these examples - they really shine together!
If I were to compute a "new" hash and store the object at that new hash; how would you ever be able to look-up the object in the structure? When doing a look-up, wouldn't it generate the "initial" hash and not the "re-computed hash".
When doing a look-up the initial hash is used. When the bits in the initial hash is exhausted, either one of the following condition is true:
The key here is hash bits exhaustion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With