Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is hashset.exceptwith twice as fast iterating and checking !contains on the other collection?

I was just doing a bit of optimizations and was baffled by this.

My original code looked something like this:

   HashSet<IExampleAble> alreadyProcessed;//a few million items
    void someSetRoutineSlower(HashSet<IExampleAble> exampleSet)
    {

        foreach (var item in exampleSet)
        {
            if (!alreadyProcessed.Contains(item))
            {
                // do Stuff
            }
        }
    }

This took about 1.2million ticks to process.

Then I tried the same with exceptwith:

 void someSetRoutineFaster(HashSet<IExampleAble> exampleSet)
    {
        exampleSet.ExceptWith(alreadyProcessed);//doesnt this have to check each item of it's collection against the other one, thus actually looping twice?
        foreach (var item in exampleSet)
        {
            // do Stuff
        }
    }

and it was running running at about 0.4mil-0.7mil ticks.

What kind of optimization is going on in exceptwith? Doesn't it also have to do a check over all items as I do in the first code-snippet?

like image 723
user3488765 Avatar asked Nov 20 '22 12:11

user3488765


1 Answers

According to Reference Source for HashSet ExceptWith method in .NET Framework 4.7.2 looks like this:

public void ExceptWith(IEnumerable<T> other) {
        if (other == null) {
            throw new ArgumentNullException("other");
        }
        Contract.EndContractBlock();

        // this is already the enpty set; return
        if (m_count == 0) {
            return;
        }

        // special case if other is this; a set minus itself is the empty set
        if (other == this) {
            Clear();
            return;
        }

        // remove every element in other from this
        foreach (T element in other) {
            Remove(element);
        }
    }

Only explicit optimizations in the method are for the special cases when a set is empty or "excepted" with itself.

Speed up you are experiencing probably comes from the difference between calling Contains(T) and iterating over all elements when the number of Contains(T) calls is comparable to the set size. On the face of it, it seems it should perform the same, old implementation called Contains(T) explicitly, new implementation performs the same kind of search in Remove(T). The difference is as the elements are being removed the set its internal structure becomes more sparse. This results in a statistically less items (slots as per source code notation) per bucket and finding an element becomes faster, if present then it's the first item in the bucket.

It all depends on the quality of hashing function for your objects. Ideally each object should be alone in it's bucket but most real hash functions would distribute million elements with collisions (multiple elements in the same bucket).

like image 103
Emperor Orionii Avatar answered Jun 12 '23 02:06

Emperor Orionii