Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hash function for 64 bit to 10 bits

I want a hash function that takes a long number (64 bits) and produces result of 10 bits. What is the best hash function for such purpose. Inputs are basically addresses of variables (Addresses are of 64 bits or 8 bytes on Linux), so my hash function should be optimized for that purpose.

like image 651
MetallicPriest Avatar asked Jul 06 '12 09:07

MetallicPriest


3 Answers

I would say somethig like this:

uint32_t hash(uint64_t x)
{
    x >>= 3;
    return (x ^ (x>>10) ^ (x>>20)) & 0x3FF;
}

The lest significant 3 bits are not very useful, as most variables are 4-byte or 8-byte aligned, so we remove them. Then we take the next 30 bits and mix them together (XOR) in blocks of 10 bits each.

Naturally, you could also take the (x>>30)^(x>>40)^(x>>50) but I'm not sure if they'll make any difference in practice.

like image 181
rodrigo Avatar answered Oct 20 '22 02:10

rodrigo


I wrote a toy program to see some real addresses on the stack, data area, and heap. Basically I declared 4 globals, 4 locals and did 2 mallocs. I dropped the last two bits when printing the addresses. Here is an output from one of the runs:

 20125e8
 20125e6
 20125e7
 20125e4
3fef2131
3fef2130
3fef212f
3fef212c
 25e4802
 25e4806

What this tells me:

  1. The LSB in this output (3rd bit of the address) is frequently 'on' and 'off'. So I wouldn't drop it when calculating the hash. Dropping 2 LSBs seems enough.
  2. We also see that there is more entropy in the lower 8-10 bits. We must use that when calculating the hash.
  3. We know that on a 64 bit machine, virtual addresses are never more than 48 bits wide.

What I would do next:

/* Drop two LSBs.  */
a >>= 2;

/* Get rid of the MSBs. Keep 46 bits. */
a &= 0x3fffffffffff;

/* Get the 14 MSBs and fold them in to get a 32 bit integer.
The MSBs are mostly 0s anyway, so we don't lose much entropy.  */
msbs = (a >> 32) << 18;
a ^= msbs;

Now we pass this through a decent 'half avalanche' hash function, instead of rolling our own. 'Half avalanche' means each bit of the input gets a chance to affect bits at the same position and higher:

uint32_t half_avalanche( uint32_t a)
{
    a = (a+0x479ab41d) + (a<<8);
    a = (a^0xe4aa10ce) ^ (a>>5);
    a = (a+0x9942f0a6) - (a<<14);
    a = (a^0x5aedd67d) ^ (a>>3);
    a = (a+0x17bea992) + (a<<7);
    return a;
}

For an 10-bit hash, use the 10 MSBs of the uint32_t returned. The hash function continues to work fine if you pick N MSBs for an N bit hash, effectively doubling the bucket count with each additional bit.

I was a little bored, so I wrote a toy benchmark for this. Nothing fancy, it allocates a bunch of memory on the heap and tries out the hash I described above. The source can be had from here. An example result:

1024 buckets, 256 values generated, 29 collissions
1024 buckets, 512 values generated, 103 collissions
1024 buckets, 1024 values generated, 370 collissions

Next: I tried out the other two hashes answered here. They both have similar performance. Looks like: Just pick the fastest one ;)

like image 44
ArjunShankar Avatar answered Oct 20 '22 04:10

ArjunShankar


Best for most distributions is mod by a prime, 1021 is the largest 10-bit prime. There's no need to strip low bits.

static inline int hashaddress(void *v)
{
        return (uintptr_t)v % 1021;
}

If you think performance might be a concern, have a few alternates on hand and race them in your actual program. Microbenchmarks are waste; a difference of a few cycles is almost certain to be swamped by cache effects, and size matters.

like image 34
jthill Avatar answered Oct 20 '22 03:10

jthill