Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Asymptotic behavior IEnumerable.Intersect vs HashedSet.IntersectWith

I've read a lot of posts and blogs about HashSet and LINQ Set Operations and I get the impression that the linq intersection method internally uses hashed set as the first set, and IEnumerable as the second. So the difference between the two is either O(n + m) for linq intersection, vs. O(n) for hashed set intersection between two hashed sets. Can I get a confirmation of this? The big O for LINQ intersect is not documented in MSDN.

like image 751
Wajid Ananda Poernomo Avatar asked Oct 21 '22 18:10

Wajid Ananda Poernomo


1 Answers

Well, it's implementation-specific, so in theory it could change - but basically the difference is just that using HashSet.IntersectWith you're starting with a hash set, so you only need to iterate over one collection.

The "obvious" implementations would give O(M + N) and O(N) complexity for Intersect and IntersectWith respectively - assuming a decent hash code, of course. I would be immensely surprised to see any other implementation, and I certainly haven't seen any evidence that any version of .NET has shipped with anything other than that.

Arguably if both arguments to Intersect were already HashSet<T> you could optimize this to just iterate over the smaller set and check whether each element is in the larger one. However, this has another problem that the sets may not use the same comparer as each other or as the Intersect call.

See my Edulinq implementation and post for some more details, including a note around an inaccuracy in MSDN. MSDN claims (at the time of this writing) that:

When the object returned by this method is enumerated, Intersect enumerates first, collecting all distinct elements of that sequence. It then enumerates second, marking those elements that occur in both sequences. Finally, the marked elements are yielded in the order in which they were collected.

That's not actually true, either in terms of the ordering or the timing:

  • It's second which is enumerated initially (completely, when MoveNext() is first called on the returned sequence)
  • The results are yielded as it iterates over first - they're streamed, rather than the "mark everything then yield the results" claimed by MSDN
like image 86
Jon Skeet Avatar answered Oct 27 '22 10:10

Jon Skeet