Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How likely is it to get a HashCode collision with this hashcode function?

How likely is it to get a HashCode collision with the function below in following scenarios.

  1. With random int values for key[0],key[1], key[2], key[3]
  2. With random key values with the following constraints
    • key[0] <1,000,000
    • key[1] <10,000
    • key[2] <1,000
    • key[3] <1,000

Assume we have 10 Million objects.

int[] key=new int[4];    
public override int GetHashCode()
{
    // Use large prime multiples to create a unique hash key
    // Create the hash offsets using a "even powers of 2 minus 1" method, which gives 
    // primes most of the time.  
    int hashKey = 0;
    hashKey += 2047 * key[0];
    hashKey += 8191 * key[1];
    hashKey += 32767 * key[2];
    hashKey += 131071 * key[3];
    return hashKey;
}
like image 793
derdo Avatar asked Mar 16 '11 21:03

derdo


2 Answers

This is kind of a strange question. Let's start with the obvious errors in the code:

// Use large prime multiples to create a unique hash key     
// Create the hash offsets using a "even powers of 2 minus 1" method, which gives      
// primes most of the time.   

First off, those are all odd powers of two minus one; none of them are even powers of two minus one.

Second, of the four multipliers you've chosen as "large prime multiples", half of them are not prime. 2047 and 32767 are composite.

Third, if we "correct" -- and I use the word advisedly -- the statement to be "odd powers of 2 minus one which gives primes most of the time" that statement is absurdly wrong. A prime of that form is known as a Mersenne prime, and there are only 47 known Mersenne primes. I assure you that the density of Mersenne primes is considerably lower than one half. Put it this way: of the odd-power Mersenne numbers between 2^1-1 and 2^43112609−1, 46 of them are known to be prime numbers, which is about one in a half a million, not half.

Fourth, what do you imagine prime numbers have to do with anything? What mythological power do prime numbers have? Surely what matters is that the distribution of hash codes tends to not produce multiples of the hash table size. Since the hash table size is chosen to be a prime number, it seems like this is potentially exacerbating the problem.

Fifth, hash keys are not unique; your question is about when they collide, so clearly they cannot be unique.

Sixth, suppose your hash function had a perfectly random distribution across the space of 32 bit integers. By the birthday "paradox" you'd expect there to be a far greater than 99% chance of at least one collision when drawing ten million numbers at random from a 32 bit space. In fact, the expected number of collisions would be on the order of ten or twenty thousand. (We could work out the exact number of expected collisions, but who cares what it is exactly; it is in that order of magnitude.)

Is that too many collisions? It is going to be very difficult to do better than a random distribution. If you require fewer collisions than that, then you shouldn't be using a 32 bit hashing algorithm in the first place.

Seventh, who cares how many collisions a hash function has across its full range? Surely the practical question ought to really be "how does this hash perform with realistic data in a large table?" You, unlike us, can answer that question by trying it. If it meets your performance budget, great, worry about something else. If it doesn't, figure out why not before you start blaming the hash function.

I am very confused by this question and what you hope to gain from its answer. Can you explain?

like image 115
Eric Lippert Avatar answered Sep 27 '22 21:09

Eric Lippert


I wrote a quick script to test this.

import random

def hash(key):
    hashKey = 0
    hashKey += 2047 * key[0]
    hashKey += 8191 * key[1]
    hashKey += 32767 * key[2]
    hashKey += 131071 * key[3]
    return hashKey

seen = set()
collisions = 0
for i in range(0,10000000):
    x = hash([random.randint(0,1000000), random.randint(0,10000), random.randint(0,1000), random.randint(0,1000)])
    if x in seen:
        collisions += 1
    else:
        seen.add(x)

print collisions

When I ran it, it told me I got 23735 collisions. I also tried it on one million elements, and I got 247 collisions. Both numbers are averages over 4 runs.

like image 45
Viktor Dahl Avatar answered Sep 27 '22 21:09

Viktor Dahl