For some reason, it seems the Add
operation on a HashSet
is slower than the Contains
operation when the element already exists in the HashSet
.
Here is proof:
Stopwatch watch = new Stopwatch();
int size = 10000;
int iterations = 10000;
var s = new HashSet<int>();
for (int i = 0; i < size; i++) {
s.Add(i);
}
Console.WriteLine(watch.Time(() =>
{
for (int i = 0; i < size; i++) {
s.Add(i);
}
}, iterations));
s = new HashSet<int>();
for (int i = 0; i < size; i++) {
s.Add(i);
}
// outputs: 47,074,764
Console.WriteLine(watch.Time(() =>
{
for (int i = 0; i < size; i++) {
if (!s.Contains(i))
s.Add(i);
}
}, iterations));
// outputs: 41,125,219
Why is Contains
faster than Add
for already-existing elements?
Note: I'm using this Stopwatch
extension from another SO question.
public static long Time(this Stopwatch sw, Action action, int iterations) {
sw.Reset();
sw.Start();
for (int i = 0; i < iterations; i++) {
action();
}
sw.Stop();
return sw.ElapsedTicks;
}
UPDATE: Internal testing has revealed that the big performance diff only happens on the x64 version of the .NET framework. With the 32 bit version of the framework Contains seems run at identical speed to add (in fact it appears that the version with the contains runs a percent slower in some test runs) On X64 versions of the framework, the version with the contains seems to run about 15% faster.
On average, the contains() of HashSet runs in O(1) time. Getting the object's bucket location is a constant time operation. Taking into account possible collisions, the lookup time may rise to log(n) because the internal bucket structure is a TreeMap.
HashSet becomes faster for 10% only if we List is without specified capacity and checks each value before adding through whole list. If items count reduced to 4 then List again wins even in worst scenario (with 10% difference).
The worst-case performance of contains will be O(log n) for Java 8, and O(n) for Java 7, but average case closer to O(1). This is because the hashset is backed by a hashmap, and thus has the same efficiency as hashmap lookup (ie, HashMap.
Both Vector and HashSet Collection implementation classes performed poorly compared to the ArrayList Collection implementation class. Vector scored 68 TPS on average, while HashSet scored 9200 TPS on average. On the other hand the ArrayList outperformed Vector and HashSet by far, resulting in 421000 TPS on average.
AddIfNotPresent does an additional divide that Contains doesn't perform. Take a look at the IL for Contains:
IL_000a: call instance int32 class System.Collections.Generic.HashSet`1<!T>::InternalGetHashCode(!0)
IL_000f: stloc.0
IL_0010: ldarg.0
IL_0011: ldfld int32[] class System.Collections.Generic.HashSet`1<!T>::m_buckets
IL_0016: ldloc.0
IL_0017: ldarg.0
IL_0018: ldfld int32[] class System.Collections.Generic.HashSet`1<!T>::m_buckets
IL_001d: ldlen
IL_001e: conv.i4
IL_001f: rem
IL_0020: ldelem.i4
IL_0021: ldc.i4.1
IL_0022: sub
IL_0023: stloc.1
This is computing the bucket location for the hash code. The result is saved at local memory location 1.
AddIfNotPresent does something similar, but it also saves the computed value at location 2, so that it can insert the item into the hash table at that position if the item doesn't exist. It does that save because one of the locations is modified later in the loop that goes looking for the item. Anyway, here's the relevant code for AddIfNotPresent:
IL_0011: call instance int32 class System.Collections.Generic.HashSet`1<!T>::InternalGetHashCode(!0)
IL_0016: stloc.0
IL_0017: ldloc.0
IL_0018: ldarg.0
IL_0019: ldfld int32[] class System.Collections.Generic.HashSet`1<!T>::m_buckets
IL_001e: ldlen
IL_001f: conv.i4
IL_0020: rem
IL_0021: stloc.1
IL_0022: ldarg.0
IL_0023: ldfld int32[] class System.Collections.Generic.HashSet`1<!T>::m_buckets
IL_0028: ldloc.0
IL_0029: ldarg.0
IL_002a: ldfld int32[] class System.Collections.Generic.HashSet`1<!T>::m_buckets
IL_002f: ldlen
IL_0030: conv.i4
IL_0031: rem
IL_0032: ldelem.i4
IL_0033: ldc.i4.1
IL_0034: sub
IL_0035: stloc.2
Anyway, I think the extra divide is what's causing Add to take more time than Contains. At first glance, it looks like that extra divide could be factored out, but I can't say for sure without spending a little more time deciphering the IL.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With