Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hash Array Mapped Trie (HAMT)

I am trying to get my head around the details of a HAMT. I'd have implemented one myself in Java just to understand. I am familiar with Tries and I think I get the main concept of the HAMT.

Basically,

Two types of nodes:

Key/Value

Key Value Node:
  K key
  V value

Index

Index Node:
  int bitmap (32 bits)
  Node[] table (max length of 32)
  1. Generate a 32-bit hash for an object.
  2. Step through the hash 5-bits at a time. (0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-31) note: the last step (7th) is only 2 bits.
  3. At each step, find the position of that 5-bit integer in the bitmap. e.g. integer==5 bitmap==00001
    1. If the bit is a 1 then that part of the hash exist.
    2. If the bit is a 0 then key doesn't exist.
  4. If the key exists, find it's index into the table by counting the number of 1s in the bitmap between 0 and the position. e.g. integer==6 bitmap==0101010101 index==3
    1. If the table points to a key/value node then compare the keys.
    2. If the table points to a index node then go to 2 moving ahead one step.

The part I don't quite understand is collision detection and mitigation. In the linked paper he alludes to it:

The existing key is then inserted in the new sub-hash table and the new key added. Each time 5 more bits of the hash are used the probability of a collision reduces by a factor of 1/32. Occasionally an entire 32 bit hash may be consumed and a new one must be computed to differentiate the two keys.

If I were to compute a "new" hash and store the object at that new hash; how would you ever be able to look-up the object in the structure? When doing a look-up, wouldn't it generate the "initial" hash and not the "re-computed hash".

I must be missing something.....

BTW: The HAMT performs fairly well, it's sits between a hash map and tree map in my tests.

Data Structure                    Add time   Remove time     Sorted add time Sorted remove time   Lookup time     Size     
Java's Hash Map                   38.67 ms   18 ms           30 ms           15 ms                16.33 ms        8.19 MB        
HAMT                              68.67 ms   30 ms           57.33 ms        35.33 ms             33 ms           10.18 MB       
Java's Tree Map                   86.33 ms   78 ms           75 ms           39.33 ms             76 ms           8.79 MB        
Java's Skip List Map              111.33 ms  106 ms          72 ms           41 ms                72.33 ms        9.19 MB     
like image 941
Justin Avatar asked Oct 31 '13 19:10

Justin


3 Answers

There's two sections of the paper I think you might of missed. The first is the bit immediately preceding the bit you quoted:

Or the key will collide with an existing one. In which case the existing key must be replaced with a sub-hash table and the next 5 bit hash of the existing key computed. If there is still a collision then this process is repeated until no collision occurs.

So if you have object A in the table and you add object B which clashes, the cell at which their keys clashed will be a pointer to another subtable (where they don't clash).

Next, Section 3.7 of the paper you linked describes the method for generating a new hash when you run off the end of your first 32 bits:

The hash function was tailored to give a 32 bit hash. The algorithm requires that the hash can be extended to an arbitrary number of bits. This was accomplished by rehashing the key combined with an integer representing the trie level, zero being the root. Hence if two keys do give the same initial hash then the rehash has a probability of 1 in 2^32 of a further collision.

If this doesn't seem to explain anything, say and I'll extend this answer with more detail.

like image 124
Andy Jones Avatar answered Oct 05 '22 19:10

Andy Jones


HAMT is great and highly performant structure especially when one needs immutable objects, i.e. each time after any modification a new copy of a data structure is created!

As for your question on hash collisions, I have found a C# implementation (which is buggy now) that shows how it works: on each hash collision the key is rehashed and lookup is retried recursively until maximum iterations limit is reached.

Currently I am also exploring HAMP in functional programming context and learning existing code. There are several reference implementations of HAMT in Haskell as Data.HshMap and in Clojure as PersistenceHashMap.

There are some other simpler implementations on the web that do not deal with collisions, but they are useful to understand the concept. Here they are in Haskell and OCaml

I have found a nice summary article article that describes HAMT with pictures and links to original research papers by Phil Bagwell.

Related points:

While implementing HAMT in F# I have noticed that popCount function implementation described here really matters and gives 10-15% compared to naive implementation described in the next answers in the link. Not great, but a free lunch.

Related IntMap structures (Haskell and its port to F#) are very good when the key could be an integer and they implement related PATRICIA/Radix trie.

I believe all these implementation are very good to learn efficient immutable data structure and functional languages in all their beauty on these examples - they really shine together!

like image 20
V.B. Avatar answered Oct 05 '22 20:10

V.B.


If I were to compute a "new" hash and store the object at that new hash; how would you ever be able to look-up the object in the structure? When doing a look-up, wouldn't it generate the "initial" hash and not the "re-computed hash".

When doing a look-up the initial hash is used. When the bits in the initial hash is exhausted, either one of the following condition is true:

  1. we end up with a key/value node - return it
  2. we end up with an index node - this is the hint that we have to go deeper by recomputing a new hash.

The key here is hash bits exhaustion.

like image 36
holygeek Avatar answered Oct 05 '22 18:10

holygeek