Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What integer hash function are good that accepts an integer hash key?

Tags:

c

algorithm

hash

What integer hash function are good that accepts an integer hash key?

like image 683
Lear Avatar asked Sep 26 '22 15:09

Lear


People also ask

Can a hash function hash an integer key?

For example: h(x) = x mod N is a hash function for integer keys and the integer h(x) is called the hash value of key x. Hash functions are designed to be fast and to yield few hash collisions in expected input domains. Collisions are inevitable, however.

What is the integer result of the hash function called as?

The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table.

What is a good hash function?

Characteristics of a Good Hash Function. There are four main characteristics of a good hash function: 1) The hash value is fully determined by the data being hashed. 2) The hash function uses all the input data. 3) The hash function "uniformly" distributes the data across the entire set of possible hash values.

How do you choose a good hash function?

A good hash function should have the following properties: Efficiently computable. Should uniformly distribute the keys (Each table position equally likely for each key)


2 Answers

I found the following algorithm provides a very good statistical distribution. Each input bit affects each output bit with about 50% probability. There are no collisions (each input results in a different output). The algorithm is fast except if the CPU doesn't have a built-in integer multiplication unit. C code, assuming int is 32 bit (for Java, replace >> with >>> and remove unsigned):

unsigned int hash(unsigned int x) {
    x = ((x >> 16) ^ x) * 0x45d9f3b;
    x = ((x >> 16) ^ x) * 0x45d9f3b;
    x = (x >> 16) ^ x;
    return x;
}

The magic number was calculated using a special multi-threaded test program that ran for many hours, which calculates the avalanche effect (the number of output bits that change if a single input bit is changed; should be nearly 16 on average), independence of output bit changes (output bits should not depend on each other), and the probability of a change in each output bit if any input bit is changed. The calculated values are better than the 32-bit finalizer used by MurmurHash, and nearly as good (not quite) as when using AES. A slight advantage is that the same constant is used twice (it did make it slightly faster the last time I tested, not sure if it's still the case).

You can reverse the process (get the input value from the hash) if you replace the 0x45d9f3b with 0x119de1f3 (the multiplicative inverse):

unsigned int unhash(unsigned int x) {
    x = ((x >> 16) ^ x) * 0x119de1f3;
    x = ((x >> 16) ^ x) * 0x119de1f3;
    x = (x >> 16) ^ x;
    return x;
}

For 64-bit numbers, I suggest to use the following, even thought it might not be the fastest. This one is based on splitmix64, which seems to be based on the blog article Better Bit Mixing (mix 13).

uint64_t hash(uint64_t x) {
    x = (x ^ (x >> 30)) * UINT64_C(0xbf58476d1ce4e5b9);
    x = (x ^ (x >> 27)) * UINT64_C(0x94d049bb133111eb);
    x = x ^ (x >> 31);
    return x;
}

For Java, use long, add L to the constant, replace >> with >>> and remove unsigned. In this case, reversing is more complicated:

uint64_t unhash(uint64_t x) {
    x = (x ^ (x >> 31) ^ (x >> 62)) * UINT64_C(0x319642b2d24d8ec3);
    x = (x ^ (x >> 27) ^ (x >> 54)) * UINT64_C(0x96de1b173f119089);
    x = x ^ (x >> 30) ^ (x >> 60);
    return x;
}

Update: You may also want to look at the Hash Function Prospector project, where other (possibly better) constants are listed.

like image 188
Thomas Mueller Avatar answered Oct 11 '22 06:10

Thomas Mueller


Knuth's multiplicative method:

hash(i)=i*2654435761 mod 2^32

In general, you should pick a multiplier that is in the order of your hash size (2^32 in the example) and has no common factors with it. This way the hash function covers all your hash space uniformly.

Edit: The biggest disadvantage of this hash function is that it preserves divisibility, so if your integers are all divisible by 2 or by 4 (which is not uncommon), their hashes will be too. This is a problem in hash tables - you can end up with only 1/2 or 1/4 of the buckets being used.

like image 54
Rafał Dowgird Avatar answered Oct 11 '22 06:10

Rafał Dowgird