Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why such a high collision rate with my ELF hash implementation

Tags:

c#

.net

hash

I am currently working on choosing a couple of general-purpose hashing functions for use in Object.GetHashCode() overrides. Initially, on the recommendation of this site I started with ELF. My C# implementation is below:

public int Generate(byte[] key) {
    const uint c = 0xf0000000;

    uint h = 0,
         g = 0;

    unchecked {
        for (int i = 0, len = key.Length; i < len; i++) {
            h = (h << 4) + key[i];

            if ((g = h & c) != 0)
                h ^= g >> 24;

            h &= ~g;
        }
    }

    return (int)h;
}

My test case consists of 524,288 unique values divided into short (1-64) and long (256-2048) strings (limited ASCII character set) and arbitrary binary data (131,072 each) to test each algorithm under a variety of circumstances.

I also understand the limitations of this test scenario. A hashing algorithm may perform exceptionally well when hashing, say, URLs, but be awful at hashing JPGs or something. Random strings/binary seems to me to be the best starting point for choosing a general purpose function though. I am happy to hear reasons for why this is not the case.

I performed 3 separate test runs (generating a new set of random strings/bytes each time) and averaged the results.

The ELF algorithm produced a horrific number of collisions in comparison to the other algorithms I'm testing:

  • Short strings: 817 collisions (~0.5% fail rate).
  • Short binary: 550 collisions (~0.4% fail rate)
  • Long strings/binary: 34 collisions (~0.025% fail rate).

To place this in context, the other 3 algorithms I tested produced on average between 3-10 collisions on average for the same tests. It is also amongst the slowest of the 4, so at this point it appears to be entirely useless.

Full results:

           Strings     Binary
Algorithm  short:long  short:long
ELF        817:40      550:28
FNV        1.6:2       0.6:2.6
OAT        9:9.6       14:5
Jenkins*   2:1.3       12:3.6

* A close approximation of the lookup3 hash function.

So for the same random samples that ELF is struggling on (I have generated 3 separate sets), all other tested algorithms are producing way way fewer collisions.

I have searched for variants of the ELF algorithm, but the few examples I have found seem consistent with what I have implemented. The only variation I have seen was on this SO question: Using ELF to produce a tweaked hashmap. This variation includes h &= g >> 24 within the if-block, and clips the result to 31 bits. I tested that variation and it produced the same awful results.

Have I done something subtley but horribly wrong? I can't understand why it's performing so badly given that it is allegedly widely used in Unix.

like image 726
Quick Joe Smith Avatar asked Feb 06 '12 12:02

Quick Joe Smith


2 Answers

It is not a cryptographic hash, it is a hashtable hash.

This is a perfectly reasonable performance for a hash function intended for use in a hash table. Typically you will be storing between hundreds and hundreds of thousands of objects and will want to quickly store and retrieve the objects.

You do this by dividing into buckets, each containing a linked list (or maybe an array). You then calculate the hash, and taking the remainder when dividing by the number of buckets, you locate the bucket. Then you walk the linked list comparing each object to find the one you want.

If the bucket is empty the object is not found. You can then either create one or take the other appropriate action depending on your application.

The hashtable should be sized to have approximately the same number of buckets as the expected number of items to store (or a few more), so most searches will find a bucket with zero, one or two entries.

For performance you want to balance the expense of calculating the hash against the expense of traversing a very short linked list if you get a collision. It is with this in mind that the implementation of ELF and similarly purposed hash functions are designed.

In short:

  • In a hashtable, the occasional collision is a price worth paying for a faster hash.
  • In a cryptographic hash, a slow hash is a price worth paying for avoiding collisions.

If collisions are a problem in your application, use SHA1 or SHA256 or something designed with that in mind.

Note: For your use as an implementation of object.GetHashCode() the hash code is only intended to speed comparisons ("fail fast") and for use in hash tables. You don't need it to be completely collision resistant since you are going to fall back to a full equality comparision if it collides. You need balanced performance. I suggest just hashing the most important fields (using their own GetHashCode()) and XORing the values.

Edit: See also these hashes here:

  • http://www.burtleburtle.net/bob/hash/doobs.html
like image 157
Ben Avatar answered Nov 15 '22 21:11

Ben


The expected number of collisions in 524000 random samples on a 32 bit hash is 34.

You're getting 34 collisions with long strings, so for long strings, this algorithm is performing more or less as expected.

Hash collisions are far, far more likely on short strings since there is so much less entropy in the data, so it is in no way suprising to me that you're getting orders of magnitude worse performance on small strings.

It is suprising that you are getting only ten collisions with other hash algorithms. I would have expected a lot more.

On the subject of raw speed performance: you might do better to stop being so clever. The jitter may recognize and optimize the extremely common pattern:

for(int i = 0; i < array.Length; ++i)
    do something with array[i]

so as to avoid the recomputation of Length and avoid the range check on the array access. By trying to be clever and avoid the recomputation of Length, you might be fooling the jitter into no longer optimizing away the range check.

If you wish to always avoid the range check, you can always go to unsafe code; fix the array in place, obtain a pointer to it, and then increment the pointer, like you're writing the program in C. You take responsibility for ensuring the memory safety at that point, but odds are good it will be faster.

Of course, that "armchair" performance analysis is worth exactly what you paid for it; to get a real analysis, try it and see what happens.

like image 31
Eric Lippert Avatar answered Nov 15 '22 19:11

Eric Lippert