Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HashSets don't keep the elements unique if you mutate their identity

Tags:

c#

.net

clr

hashset

When working with HashSets in C#, I recently came across an annoying problem: HashSets don't guarantee unicity of the elements; they are not Sets. What they do guarantee is that when Add(T item) is called the item is not added if for any item in the set item.equals(that) is true. This holds no longer if you manipulate items already in the set. A small program that demonstrates (copypasta from my Linqpad):

void Main()
{
    HashSet<Tester> testset = new HashSet<Tester>();
    testset.Add(new Tester(1));
    testset.Add(new Tester(2));
    foreach(Tester tester in testset){
      tester.Dump();
    }
    foreach(Tester tester in testset){
      tester.myint = 3;
    }
    foreach(Tester tester in testset){
      tester.Dump();
    }
    HashSet<Tester> secondhashset = new HashSet<Tester>(testset);
    foreach(Tester tester in secondhashset){
      tester.Dump();
    }
}

class Tester{
  public int myint;

  public Tester(int i){
    this.myint = i;
  }

  public override bool Equals(object o){
    if (o== null) return false;
    Tester that = o as Tester;
    if (that == null) return false;
    return (this.myint == that.myint);
  }

  public override int GetHashCode(){
    return this.myint;
  }

  public override string ToString(){
    return this.myint.ToString();
  }
}

It will happily manipulate the items in the collection to be equal, only filtering them out when a new HashSet is built. What is advicible when I want to work with sets where I need to know the entries are unique? Roll my own, where Add(T item) adds a copy off the item, and the enumerator enumerates over copies of the contained items? This presents the challenge that every contained element should be deep-copyable, at least in its items that influence it's equality.

Another solution would be to roll your own, and only accepts elements that implement INotifyPropertyChanged, and taking action on the event to re-check for equality, but this seems severely limiting, not to mention a whole lot of work and performance loss under the hood.

Yet another possible solution I thought of is making sure that all fields are readonly or const in the constructor. All solutions seem to have very large drawbacks. Do I have any other options?

like image 590
Martijn Avatar asked Jul 10 '12 10:07

Martijn


People also ask

Why the HashSet is unique in Java?

HashSet is an implementation of Set Collection. Therefore, HashSet is a collection of unique data. In other words, if you try to put an object in a HashSet and that object is already present, the HashSet will ignore it. HashSet allows you add one object at a time or bulk in a form of a collection.

Can HashSet have objects?

Objects that you insert in HashSet are not guaranteed to be inserted in the same order. Objects are inserted based on their hash code. NULL elements are allowed in HashSet. HashSet also implements Serializable and Cloneable interfaces.


1 Answers

You're really talking about object identity. If you're going to hash items they need to have some kind of identity so they can be compared.

  • If that changes, it is not a valid identity method. You currently have public int myint. It really should be readonly, and only set in the constructor.
  • If two objects are conceptually different (i.e. you want to treat them as different in your specific design) then their hash code should be different.
  • If you have two objects with the same content (i.e. two value objects that have the same field values) then they should have the same hash codes and should be equal.
  • If your data model says that you can have two objects with the same content but they can't be equal, you should use a surrogate id, not hash the contents.
  • Perhaps your objects should be immutable value types so the object can't change
  • If they are mutable types, you should assign a surrogate ID (i.e. one that is introduced externally, like an increasing counter id or using the object's hashcode) that never changes for the given object

This is a problem with your Tester objects, not the set. You need to think hard about how you define identity. It's not an easy problem.

like image 151
Joe Avatar answered Oct 23 '22 04:10

Joe