I am currently working on choosing a couple of general-purpose hashing functions for use in Object.GetHashCode()
overrides. Initially, on the recommendation of this site I started with ELF. My C# implementation is below:
public int Generate(byte[] key) {
const uint c = 0xf0000000;
uint h = 0,
g = 0;
unchecked {
for (int i = 0, len = key.Length; i < len; i++) {
h = (h << 4) + key[i];
if ((g = h & c) != 0)
h ^= g >> 24;
h &= ~g;
}
}
return (int)h;
}
My test case consists of 524,288 unique values divided into short (1-64) and long (256-2048) strings (limited ASCII character set) and arbitrary binary data (131,072 each) to test each algorithm under a variety of circumstances.
I also understand the limitations of this test scenario. A hashing algorithm may perform exceptionally well when hashing, say, URLs, but be awful at hashing JPGs or something. Random strings/binary seems to me to be the best starting point for choosing a general purpose function though. I am happy to hear reasons for why this is not the case.
I performed 3 separate test runs (generating a new set of random strings/bytes each time) and averaged the results.
The ELF algorithm produced a horrific number of collisions in comparison to the other algorithms I'm testing:
To place this in context, the other 3 algorithms I tested produced on average between 3-10 collisions on average for the same tests. It is also amongst the slowest of the 4, so at this point it appears to be entirely useless.
Full results:
Strings Binary Algorithm short:long short:long ELF 817:40 550:28 FNV 1.6:2 0.6:2.6 OAT 9:9.6 14:5 Jenkins* 2:1.3 12:3.6 * A close approximation of the lookup3 hash function.
So for the same random samples that ELF is struggling on (I have generated 3 separate sets), all other tested algorithms are producing way way fewer collisions.
I have searched for variants of the ELF algorithm, but the few examples I have found seem consistent with what I have implemented. The only variation I have seen was on this SO question: Using ELF to produce a tweaked hashmap. This variation includes h &= g >> 24
within the if-block, and clips the result to 31 bits. I tested that variation and it produced the same awful results.
Have I done something subtley but horribly wrong? I can't understand why it's performing so badly given that it is allegedly widely used in Unix.
It is not a cryptographic hash, it is a hashtable hash.
This is a perfectly reasonable performance for a hash function intended for use in a hash table. Typically you will be storing between hundreds and hundreds of thousands of objects and will want to quickly store and retrieve the objects.
You do this by dividing into buckets, each containing a linked list (or maybe an array). You then calculate the hash, and taking the remainder when dividing by the number of buckets, you locate the bucket. Then you walk the linked list comparing each object to find the one you want.
If the bucket is empty the object is not found. You can then either create one or take the other appropriate action depending on your application.
The hashtable should be sized to have approximately the same number of buckets as the expected number of items to store (or a few more), so most searches will find a bucket with zero, one or two entries.
For performance you want to balance the expense of calculating the hash against the expense of traversing a very short linked list if you get a collision. It is with this in mind that the implementation of ELF and similarly purposed hash functions are designed.
In short:
If collisions are a problem in your application, use SHA1 or SHA256 or something designed with that in mind.
Note: For your use as an implementation of object.GetHashCode() the hash code is only intended to speed comparisons ("fail fast") and for use in hash tables. You don't need it to be completely collision resistant since you are going to fall back to a full equality comparision if it collides. You need balanced performance. I suggest just hashing the most important fields (using their own GetHashCode()
) and XORing the values.
Edit: See also these hashes here:
The expected number of collisions in 524000 random samples on a 32 bit hash is 34.
You're getting 34 collisions with long strings, so for long strings, this algorithm is performing more or less as expected.
Hash collisions are far, far more likely on short strings since there is so much less entropy in the data, so it is in no way suprising to me that you're getting orders of magnitude worse performance on small strings.
It is suprising that you are getting only ten collisions with other hash algorithms. I would have expected a lot more.
On the subject of raw speed performance: you might do better to stop being so clever. The jitter may recognize and optimize the extremely common pattern:
for(int i = 0; i < array.Length; ++i)
do something with array[i]
so as to avoid the recomputation of Length and avoid the range check on the array access. By trying to be clever and avoid the recomputation of Length, you might be fooling the jitter into no longer optimizing away the range check.
If you wish to always avoid the range check, you can always go to unsafe code; fix the array in place, obtain a pointer to it, and then increment the pointer, like you're writing the program in C. You take responsibility for ensuring the memory safety at that point, but odds are good it will be faster.
Of course, that "armchair" performance analysis is worth exactly what you paid for it; to get a real analysis, try it and see what happens.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With