I have been looking at .NET's implementation of the dictionary, as I wanted to understand what makes the dictionary ContainsKey and lookup fast: http://referencesource.microsoft.com/#mscorlib/system/collections/generic/dictionary.cs,15debc34d286fdb3
The ContainsKey function basically leads to the FindEntry listed below:
buckets is an array of integers and entries is an array of Entry objects, which are structs containing HashCode, TKey and TValue.
So I understand that this lookup is fast since it's a simple array lookup.
private int FindEntry(TKey key) {
if( key == null) {
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.key);
}
if (buckets != null) {
int hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF;
for (int i = buckets[hashCode % buckets.Length]; i >= 0; i = entries[i].next) {
if (entries[i].hashCode == hashCode && comparer.Equals(entries[i].key, key)) return i;
}
}
return -1;
}
However I'm trying to understand these 2 lines:
int hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF;
for (int i = buckets[hashCode % buckets.Length]; i >= 0; i = entries[i].next)
1) If I unerstand correctly 0x7FFFFFFF is there to ensure that we don't get a negative value. So what does the first line return? Is it a simple integer or a prime?
2) In the second line why do we initialize i to buckets[hashCode % buckets.Length]?
The first line returns the hash code with the high bit turned off to make the number positive. It's not necessarily a prime. It's totally valid to discard data from any hash. The hash 0
(constant zero) is always a valid hash. That's why this operation is safe.
In the second line we need to map from hash code to a bucket index. Any deterministic mapping will do. So again we are discarding information from the hash by reducing the number of possible values. The modulo operator makes for quite a uniform mapping. Other mappings are possible such as simply masking off bits (again).
In the .NET Dictionary
class each bucket logically is the start of a linked list. int[] buckets
contains the index for entries
for the start of the linked list stored inside of entries
.
It's complicated for performance reasons. Logically, buckets
could be a new LinkedList<Entry>[capacity]
. That would do the same with but with a lot more allocations.
There are articles on the web about Dictionary
internals. I find the algorithm quite nice and clever. It does not need a load factor. The table can be loaded fully.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With