Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are good methods for hashing bits in an Int32 or UInt32?

I have an implementation of a pseudo random number generator, specifically of George Marsaglia's XOR-Shift RNG. My implementation is here:

FastRandom.cs

It turns out that the first random sample is very closely correlated with the seed, which is fairly obvious if you take a look at the Reinitialise(int seed) method. This is bad. My proposed solution is to mix up the bits of the seed as follows:

_x = (uint)(  (seed * 2147483647) 
           ^ ((seed << 16 | seed >> 48) * 28111) 
           ^ ((seed << 32 | seed >> 32) * 69001)
           ^ ((seed << 48 | seed >> 16) * 45083));

So I have significantly weakened any correlation by multiplying the seed's bits with four primes and XORing back to form _x. I also rotate the seed's bits before multiplication to ensure that bits of varying magnitudes get mixed up across the full range of values for a 32 bit value.

The four-way rotation just seemed liked a nice balance between doing nothing and every possible rotation (32). The primes are 'finger in the air' - enough magnitude and bit structure to jumble up the bits and 'spread' them over the full 32 bits regardless of the starting seed.

Should I use bigger primes? Is there a standard approach to this problem, perhaps with a more formal basis? I am trying to do this with minimal CPU overhead.

Thanks

=== UPDATE ===

I decided to use some primes with set bits better distributed across all 32 bits. The result is that I can omit the shifts as the multiplications achieve the same effect (hashing bits across the full range of 32 bits), so I then just add the four products to give the final seed...

_x = (uint)(  (seed * 1431655781) 
            + (seed * 1183186591) 
            + (seed * 622729787)
            + (seed * 338294347));

I could possibly get away with fewer primes/multiplciations. Two seemed too few (I could still see patterns in the first samples), three looked OK, so for a safety margin I made it four.

=== UPDATE 2 ===

FYI the above reduces to the functionally equivalent:

_x = seed * 3575866506U;

I didn't spot this initially and when I did I was wondering if overflowing at different stages in the calculation would cause a different result. I believe the answer is no - the two calculations always give the same answer.

like image 931
redcalx Avatar asked Oct 03 '12 21:10

redcalx


1 Answers

According to some researchers, CrapWow, Crap8 and Murmur3 are the best non-cryptographic hash algorithms available today that are both fast, simple and statistically good.

More information is available at Non-Cryptographic Hash Function Zoo.

Edit: As of May, 2021 the floodberry.com links to the Non-Cryptographic Hash Function Zoo are not valid. The content can still be found on archive.org.

like image 118
ogggre Avatar answered Nov 04 '22 07:11

ogggre