According to various sources, such as Wikipedia and various .edu websites found by Google, the most common ways for a hash table to resolve collisions are linear or quadratic probing and chaining. Randomized probing is briefly mentioned but not given much attention. I've implemented a hash table that uses randomized probing to resolve collisions. Assuming there is a collision, resolution works as follows:
This has the very nice property that, regardless of how many hash collisions there are in modulus space, lookup and insertion times are expected to be O(1) as long as there are few collisions in full 32-bit hash space. Because the probe sequence is pseudo-random, no clustering behavior results from modulus space collisions, unlike with linear probing. Because the entire system is open-addressed and doesn't use linked lists anywhere, you don't need to perform a memory allocation on each insertion, unlike chaining.
Furthermore, because the size of the hash is usually the size of the address space (32 bits on 32-bit machines), it is simply impossible to fit enough items in address space to cause large numbers of hash collisions in full 32-bit hash space under a good hashing scheme.
Why, then, is randomized probing such an unpopular collision resolution strategy?
One of the reasons for using linear lookup (such as double hasing) is cache locality. By making the second (rehash) function to be an addition of a small integer, most chances are that you'll hit the same cache line. It is very significant for large hashes.
Chain hashing is probably used due to its simplicity.
Python's dictionary implementation does this. A very nice comment in dictobject.c says:
...
The first half of collision resolution is to visit table indices via this
recurrence:
j = ((5*j) + 1) mod 2**i
For any initial j in range(2**i), repeating that 2**i times generates each
int in range(2**i) exactly once (see any text on random-number generation for
proof).
...
Sure looks like a linear congruential RNG to me!
Note that the full state of such an RNG is only i bits--has to be, to avoid revisiting entries--so you can't meaningfully use "[t]he full (32-bit) hash of an object" to seed the RNG. Python initially seeds j with i bits from the hash. If there is another collision, it grabs another 5 bits from the hash and throws those into the mix. (Read the rest of that comment, particularly where it talks about PERTURB_SHIFT
.) It continues that way, adding more bits with each collision, until it has used up the whole hash code. This way Python uses a decent amount of whatever randomness the hash code offers, and the code is simple and fast.
This is some of the finest code I've ever read. It's featured in chapter 18 of Beautiful Code. So I'd say you're on to something!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With