Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

FindEntry function in Dictionary.cs

I have been looking at .NET's implementation of the dictionary, as I wanted to understand what makes the dictionary ContainsKey and lookup fast: http://referencesource.microsoft.com/#mscorlib/system/collections/generic/dictionary.cs,15debc34d286fdb3

The ContainsKey function basically leads to the FindEntry listed below:

buckets is an array of integers and entries is an array of Entry objects, which are structs containing HashCode, TKey and TValue.

So I understand that this lookup is fast since it's a simple array lookup.

private int FindEntry(TKey key) {
        if( key == null) {
            ThrowHelper.ThrowArgumentNullException(ExceptionArgument.key);
        }
   if (buckets != null) {
            int hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF;
            for (int i = buckets[hashCode % buckets.Length]; i >= 0; i = entries[i].next) {
                if (entries[i].hashCode == hashCode && comparer.Equals(entries[i].key, key)) return i;
            }
        }
        return -1;
    }

However I'm trying to understand these 2 lines:

int hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF;
        for (int i = buckets[hashCode % buckets.Length]; i >= 0; i = entries[i].next)

1) If I unerstand correctly 0x7FFFFFFF is there to ensure that we don't get a negative value. So what does the first line return? Is it a simple integer or a prime?

2) In the second line why do we initialize i to buckets[hashCode % buckets.Length]?

like image 609
Iason Avatar asked Oct 30 '22 00:10

Iason


1 Answers

The first line returns the hash code with the high bit turned off to make the number positive. It's not necessarily a prime. It's totally valid to discard data from any hash. The hash 0 (constant zero) is always a valid hash. That's why this operation is safe.

In the second line we need to map from hash code to a bucket index. Any deterministic mapping will do. So again we are discarding information from the hash by reducing the number of possible values. The modulo operator makes for quite a uniform mapping. Other mappings are possible such as simply masking off bits (again).

In the .NET Dictionary class each bucket logically is the start of a linked list. int[] buckets contains the index for entries for the start of the linked list stored inside of entries.

It's complicated for performance reasons. Logically, buckets could be a new LinkedList<Entry>[capacity]. That would do the same with but with a lot more allocations.

There are articles on the web about Dictionary internals. I find the algorithm quite nice and clever. It does not need a load factor. The table can be loaded fully.

like image 152
usr Avatar answered Nov 15 '22 05:11

usr