I was just doing a bit of optimizations and was baffled by this.
My original code looked something like this:
HashSet<IExampleAble> alreadyProcessed;//a few million items
void someSetRoutineSlower(HashSet<IExampleAble> exampleSet)
{
foreach (var item in exampleSet)
{
if (!alreadyProcessed.Contains(item))
{
// do Stuff
}
}
}
This took about 1.2million ticks to process.
Then I tried the same with exceptwith:
void someSetRoutineFaster(HashSet<IExampleAble> exampleSet)
{
exampleSet.ExceptWith(alreadyProcessed);//doesnt this have to check each item of it's collection against the other one, thus actually looping twice?
foreach (var item in exampleSet)
{
// do Stuff
}
}
and it was running running at about 0.4mil-0.7mil ticks.
What kind of optimization is going on in exceptwith? Doesn't it also have to do a check over all items as I do in the first code-snippet?
According to Reference Source for HashSet ExceptWith method in .NET Framework 4.7.2 looks like this:
public void ExceptWith(IEnumerable<T> other) {
if (other == null) {
throw new ArgumentNullException("other");
}
Contract.EndContractBlock();
// this is already the enpty set; return
if (m_count == 0) {
return;
}
// special case if other is this; a set minus itself is the empty set
if (other == this) {
Clear();
return;
}
// remove every element in other from this
foreach (T element in other) {
Remove(element);
}
}
Only explicit optimizations in the method are for the special cases when a set is empty or "excepted" with itself.
Speed up you are experiencing probably comes from the difference between calling Contains(T) and iterating over all elements when the number of Contains(T) calls is comparable to the set size. On the face of it, it seems it should perform the same, old implementation called Contains(T) explicitly, new implementation performs the same kind of search in Remove(T). The difference is as the elements are being removed the set its internal structure becomes more sparse. This results in a statistically less items (slots as per source code notation) per bucket and finding an element becomes faster, if present then it's the first item in the bucket.
It all depends on the quality of hashing function for your objects. Ideally each object should be alone in it's bucket but most real hash functions would distribute million elements with collisions (multiple elements in the same bucket).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With