Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Union vs Unionwith in HashSet

Tags:

What the difference between HashSet.Union vs HashSet.Unionwith when i combine 2 hashsets.

I am trying to combine like this:

HashSet<EngineType> enginesSupportAll = _filePolicyEvaluation.EnginesSupportAll;         enginesSupportAll = enginesSupportAll != null ? new HashSet<EngineType>(engines.Union(enginesSupportAll)) : enginesSupportAll; 

what is the best method for this example and why?

like image 281
Max.Futerman Avatar asked Aug 28 '17 11:08

Max.Futerman


2 Answers

Given HashSet<T> A and HashSet<T> B there are four ways to AB:

  1. new HashSet<t>(A.Union(B))
    (see HashSet<T&>(IEnumerable<T>) and Enumerable.Union<T>(IEnumerable<T>, IEnumerable<T>))
  2. A.UnionWith(B)
  3. HashSet<T> C = new HashSet<T>(A); C.UnionWith(B);
  4. new HashSet<t>(A.Concat(B))
    (see Enumerable.Concat<T>(IEnumerable<T>, IEnumerable<T>))

Each has its advantages and disadvantages:

  • 1 and 4 are expressions that results in a HashSet whereas 2 and 3 are statements or statement blocks.
    Expression 1 and 4 can be used in more places than 2 and 3. For example, using 2 or 3 in a linq query syntax expression is cumbersome:
    from x in setofSetsA as IEnumerable<HashSet<T>> from y in setOfSetsB as IEnumerable<HashSet<T>> select x.UnionWith(y) will not work since UnionWith returns void.
  • 1, 3, and 4 preserve A and B as they are and return a fresh set whereas 2 modifies A.
    There are situations where modifying one of the original sets is bad and there are situations where at least one of the original sets can be modified without negative consequences.
  • Computational cost:

    A.UnionWith(B)
    (≈ O((log(|A∪B|) - log(|A|)) * |A∪B|) + O(|B|))

    HashSet<T> C = new HashSet<T>(A); C.UnionWith(B);
    (≈ O((log(|A∪B|) - log(|A|)) * |A∪B|) + O(|A| + |B|))

    HashSet<T>(A.Concat(B))
    (≈ O(log(|A∪B|) * |A∪B|) + O(|A| + |B|))

    HashSet<T>(A.Union(B))
    (≈ 2 * O(log(|A∪B|) * |A∪B|) + O(|A| + |B| + |A∪B|))

    The next section delves into the reference source to see the basis of these performance estimates.

Performance

HashSet<T>

In union option 1, 3, and 4 the constructor HashSet<T>(IEnumerable<T>, IEqualityComparer<T>) is used to create a HashSet<T> from an IEnumerable<T>. If the passed IEnumerable<T> has a Count property —i.e. if it is an ICollection<T>—, this property is used to set the size of the new HashSet:

int suggestedCapacity = 0; ICollection<T> coll = collection as ICollection<T>; if (coll != null) {     suggestedCapacity = coll.Count; } Initialize(suggestedCapacity); 

-- HashSet.cs line 136–141

The [Count()][10] method is never called. So if the IEnumerable's count can be retrieved without effort it is used to reserve capacity; otherwise the HashSet grows and reallocates when new elements are added.
In option 1 A.Union(B) and option 4 A.Concat(B) are not ICollection<T> therefore the created HashSet will grow and reallocate a few (≈ log(|A∪B|)) times. Option 3 can use the Count of A.

The constructor calls UnionWith to fill the new empty HashSet:

this.UnionWith(collection); 

-- HashSet.cs line 143

UnionWith(IEnumerable<T>) iterates over the elements in the IEnumerable<T> passed as argument and calls AddIfNotPresent(T) for each.

AddIfNotPresent(T) inserts elements and ensures that duplicates are never inserted into the set.
HashSet<T> is implemented as an array of slots, m_slots, and an array of buckets, m_buckets. A bucket contains just an int index into the m_slots array. Per bucket the Slots in m_slots form a linked list with an index to the next Slot in m_slots.

AddIfNotPresent(T) jumps to the correct bucket and then traverses its linked list to check if the element is already present:

for (int i = m_buckets[hashCode % m_buckets.Length] - 1; i >= 0; i = m_slots[i].next) {     if (m_slots[i].hashCode == hashCode && m_comparer.Equals(m_slots[i].value, value)) {         return false;     } } 

-- HashSet.cs line 968–975

Next a free index is found and a slot is reserved. First the list of free slots, m_freelist, is checked. When there are no slots in the freelist, the next empty slot in the m_slots array is used. More capacity is reserved (via IncreaseCapacity()) if there are no slots in the freelist and no empty slots:

int index; if (m_freeList >= 0) {     index = m_freeList;     m_freeList = m_slots[index].next; } else {     if (m_lastIndex == m_slots.Length) {         IncreaseCapacity();         // this will change during resize         bucket = hashCode % m_buckets.Length;     }     index = m_lastIndex;     m_lastIndex++; } 

-- HashSet.cs line 977–990

AddIfNotPresent(T) has three operations that demand some computation: the call to object.GetHashCode(), calling object.Equals(object) when there is a collision, and IncreaseCapacity(). The actual adding of an element only incurs the cost of setting some pointers and some ints.

When the HashSet<T> needs to IncreaseCapacity() the capacity is at least doubled. Therefore we can conclude that on average a HashSet<T> is filled for 75%. If the hashes are evenly distributed, the expectancy of a hash collision is also 75%.

SetCapacity(int, bool), called by IncreaseCapacity(), is the most expensive: it allocates new arrays, it copies the old slot array to the new array, and it recalculates the bucket lists:

Slot[] newSlots = new Slot[newSize]; if (m_slots != null) {     Array.Copy(m_slots, 0, newSlots, 0, m_lastIndex); }  ...  int[] newBuckets = new int[newSize]; for (int i = 0; i < m_lastIndex; i++) {     int bucket = newSlots[i].hashCode % newSize;     newSlots[i].next = newBuckets[bucket] - 1;     newBuckets[bucket] = i + 1; } m_slots = newSlots; m_buckets = newBuckets; 

-- HashSet.cs line 929–949

Option 1 and 4 (new HashSet<T>(A.Union(B))) will result in slightly more calls to IncreaseCapacity(). The cost —without the cost of A.Union(B) or A.Concat(B)— is approximately O(log(|A∪B|) * |A∪B|).
Whereas when using option 2 (A.UnionWith(B)) or option 3 (HashSet<T> C = new HashSet<T>(A); C.UnionWith(B)) we get a 'discount' of log(|A|) on the cost: O((log(|A∪B|) - log(|A|)) * |A∪B|). It pays (slightly) to use the largest set as the target into witch the other is merged.

Enumerable<T>.Union(IEnumerable<T>)

Enumerable<T>.Union(IEnumerable<T>) is implemented via UnionIterator<T>(IEnumerable<T>, IEnumerable<T>, IEqualityComparer<T>).
The UnionIterator uses Set<T> —an internal class in Enumerable.cs— that is very similar to HashSet<T>. UnionIterator lazily Add(T)s items from A and B to this Set<T> and yields the elements if they could be added. The work is done in Find(T, bool) which is similar to HashSet<T>.AddIfNotPresent(T). Check if the element is already present:

int hashCode = InternalGetHashCode(value); for (int i = buckets[hashCode % buckets.Length] - 1; i >= 0; i = slots[i].next) {     if (slots[i].hashCode == hashCode && comparer.Equals(slots[i].value, value)) return true; } 

-- Enumerable.cs line 2423–2426

Find a free index and reserve a slot:

int index; if (freeList >= 0) {     index = freeList;     freeList = slots[index].next; } else {     if (count == slots.Length) Resize();     index = count;     count++; } int bucket = hashCode % buckets.Length; slots[index].hashCode = hashCode; slots[index].value = value; slots[index].next = buckets[bucket] - 1; buckets[bucket] = index + 1; 

-- Enumerable.cs line 2428–2442

Resize() is similar to IncreaseCapacity(). The biggest difference between the two is that Resize() does not use a prime number for the number of buckets, so with bad GetHashCode() there is a slightly higher chance for collisions. Code of Resize():

int newSize = checked(count * 2 + 1); int[] newBuckets = new int[newSize]; Slot[] newSlots = new Slot[newSize]; Array.Copy(slots, 0, newSlots, 0, count); for (int i = 0; i < count; i++) {     int bucket = newSlots[i].hashCode % newSize;     newSlots[i].next = newBuckets[bucket] - 1;     newBuckets[bucket] = i + 1; } buckets = newBuckets; slots = newSlots; 

-- Enumerable.cs line 2448–2458

The performance cost of A.Union(B) is not significantly different from that of HashSet<T> C = new HashSet<T>(); C.UnionWith(A); C.UnionWith(B);. In option 1 (new HashSet<T>(A.Union(B))) the same HashSet is created twice resulting in a very costly 2 * O(log(|A∪B|) * (|A∪B|)). Option 4 results from knowing how HashSet<T>(IEnumerable<T>) and Enumerable.Union(IEnumerable<T>, IEnumerable<T>) are implemented. It avoids the redundant A.Union(B) leading to a cost of O(log(|A∪B|) * |A∪B|).

like image 70
Kasper van den Berg Avatar answered Sep 19 '22 07:09

Kasper van den Berg


Well, it's not HashSet.Union but Enumerable.Union, so you are using a LINQ extension method that works with any kind of IEnumerable<> whereas HashSet.UnionWith is a real HashSet method that modifes the current instance.

  • Union returns an IEnumerable<TSource>
  • UnionWith is void, it modifies the current HashSet instance
  • maybe UnionWith is slightly more efficient because it can be optimized

If you don't want to support any kind of sequence in your method so HashSet is fix and you can modify it, use that, otherwise use the LINQ extension. If you create the HashSet instance just for this purpose it doesn't really matter and i would prefer LINQ to be more flexible and to be able to chain my query.

like image 42
Tim Schmelter Avatar answered Sep 22 '22 07:09

Tim Schmelter