Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Good GetHashCode() override for List of Foo objects respecting the order

EnumerableObject : IEnumerable<Foo>

wraps a List<Foo>

If EnumerableObject a.SequenceEquals( EnumerableObject b), then they are equal.

Therefore, a GetHashCode must be implemented. The problem is XORing each element in the list will return the same hash code for any list with all and only the same elements, regardless of order. This is Okay in terms of it working, but will result in many collisions, which will slow down retrieval, etc.

What is a good, fast GetHashCode method for lists of objects that is order dependent?

like image 818
Ben B. Avatar asked Nov 11 '11 13:11

Ben B.


People also ask

When should we override the GetHashCode () method?

It's my understanding that the original GetHashCode() returns the memory address of the object, so it's essential to override it if you wish to compare two different objects. EDITED: That was incorrect, the original GetHashCode() method cannot assure the equality of 2 values.

What is GetHashCode used for?

The GetHashCode method provides this hash code for algorithms that need quick checks of object equality. For information about how hash codes are used in hash tables and for some additional hash code algorithms, see the Hash Function entry in Wikipedia. Two objects that are equal return hash codes that are equal.


2 Answers

I'd do it the same way I normally combine hash codes - with an addition and a multiplication:

public override int GetHashCode() {     unchecked     {         int hash = 19;         foreach (var foo in foos)         {             hash = hash * 31 + foo.GetHashCode();         }         return hash;     } } 

(Note that you shouldn't add anything to the list after this has been used for the key in a hash table of any description, as the hash will change. This also assumes that there are no null entries - if there could be, you need to take account of that.)

like image 171
Jon Skeet Avatar answered Sep 22 '22 16:09

Jon Skeet


Firstly, double-check that you need a hashcode at all. Are you going to be putting these lists into a hash-mapped structure (e.g. dictionary, hashset, etc)? If not, forget about it.

Now, assuming that you mean that EnumerableObject already overrides Equals(object) (and hopefully therefore also implements IEquatable<EnumerableObject>) for some reason, then this is indeed necessary. You want to balance speed versus bit distribution.

A good starting point is a mult+add or a shift+xor like:

public override int GetHashCode() {     int res = 0x2D2816FE;     foreach(var item in this)     {         res = res * 31 + (item == null ? 0 : item.GetHashCode());     }     return res; } 

(This assumes that you are using item.Equals() for your sequence equality comparison, if you're using an IEqualityComparer's equals you'll need to call into its hashcode).

From there we can optimise.

If null items are disallowed, remove the null-check (be careful, this will make the code throw if it ever does find a null).

If very large lists are common we need to reduce the number examined, while trying not to result in lots of collisions. Compare the following different implementations:

public override int GetHashCode() {     int res = 0x2D2816FE;     int max = Math.Min(Count, 16);     for(int i = 0, i != max; ++i)     {         var item = this[i];         res = res * 31 + (item == null ? 0 : item.GetHashCode());     }     return res; }  public override int GetHashCode() {     int res = 0x2D2816FE;     int min = Math.Max(-1, Count - 16);     for(int i = Count -1, i != min; --i)     {         var item = this[i];         res = res * 31 + (item == null ? 0 : item.GetHashCode());     }     return res; }  public override int GetHashCode() {     int res = 0x2D2816FE;     int step = Count / 16 + 1;     for(int i = 0, i < Count; i += step)     {         var item = this[i];         res = res * 31 + (item == null ? 0 : item.GetHashCode());     }     return res; } 

Each of these restrict the total number of items examined, which speeds execution but risks poorer quality hashes. Which (if any) is best depends on whether collections with the same start or the same end are more likely.

Changing the number 16 above adjusts the balance; smaller is faster but higher is better hash quality with a lower risk of hash collisions.

Edit: And now you can use my implementation of SpookyHash v. 2:

public override int GetHashCode() {   var hasher = new SpookyHash();//use methods with seeds if you need to prevent HashDos   foreach(var item in this)     hasher.Update(item.GetHashCode());//or relevant feeds of item, etc.   return hasher.Final().GetHashCode(); } 

This will create a much better distribution than mult+add or shift+xor, while also being particularly fast (especially in 64-bit processes as the algorithm is optimised for that, though it works well on 32-bit too).

like image 20
Jon Hanna Avatar answered Sep 19 '22 16:09

Jon Hanna