Why doesn't IOrderedEnumerable re-implement .Contains() to gain performance

Question

If you go here: The IOrderedEnumerableDocs and click on the .Contains() method then it takes you to here: the generalised Enumerable.Contains() docs

I take this to mean that it's just using the underlying IEnumerable implementation?

This seems strange given the potential for a more performant search given that you know you have a sorted list that you can compare against your element (e.g. to do a binary search to confirm whether the element is present, rather than enumerating the whole set?

Am I missing anything?

Jon Hanna · Accepted Answer

It's worth noting from the beginning that the fact that a given method is only documented as operating on IEnumerable<T> does not mean that it isn't optimised for given implementations or derived interfaces. In fact a great many of the methods in Enumerable take different paths for different derived interfaces and/or concrete implementations. The classic example here is that Count() takes a different path if the IEnumerable<T> it is called on implements ICollection<T> or ICollection. There are several further examples of this in the full framework, and even more in .NET Core, including some which take optimised paths for the implementation of IOrderedEnumerable<T> you get from calling OrderBy().

Some of which are my doing, because my hobby these days is contributing to .NET Core, particularly to Linq, and particularly performance improvements (though obviously if I'm hacking on something I need to increase tests on the bits I'm touching, and when doing so turns up minor bugs they get prioritised over performance improvements).

When it comes to IOrderedEnumerable, I've done things like change .OrderBy(someLambda).Skip(j).Take(k) (common paging idiom) from O(n log n) time to compute and O(j + k) time to enumerate to O(n + k log k) time to compute and O(k) time to enumerate, and .OrderBy(someLambda).First() for O(n) space and O(n log n) time to O(1) space and O(n) time, and so on.

I might look at improving other methods, and of course if I don't it's quite possible someone else would.

If I do, I would not do it as you suggest.

Firstly, to have a separate overload for IOrderedEnumerable<T> would require adding a method to the public API but only cover some cases (maybe what we're given as an IEnumerable<T> is in fact an IOrderedEnumerable<T>). Much better to just have an overload for IEnumerable<T> and detect the IOrderedEnumerable<T> case.

Secondly to use binary search the we would have to know the means by which the IOrderedEnumerable was sorted. This is possible with the OrderedEnumerable<TElement, TKey> created by calls of OrderBy but not more generally.

Thirdly, it would not be the biggest possible gain.

The current costs of source.OrderBy(someLambda).Contains(someItem) are as follows:

Buffer source: O(n) space, O(n) time.
Sort the buffer: O(n log n) time (average, O(n²) worse).
Find the an item that matches someItem, or confirm none exists.: O(n) time.

If Contains() was optimsed to use binary search it would become:

Buffer source: O(n) space, O(n) time.
Sort the buffer: O(n log n) time (average, O(n²) worse).
Find the an item that matches someItem, or confirm none exists.: O(log n) time (average, O(n) worse because a precise match may sort at the same level as all elements and have to be compared with all of them).

However, that's a complete waste. If we want to optimise Contains() (and a great many other aggregate methods for that matter) the optimum stragey would be:

Call source.Contains(someItem) and return the result. This will at worse be O(n) time and O(1) space, though it may be O(log n) or O(1) time if source is for example a HashSet<T> (a case that Contains() is already optimised for). In both theory and practice it will end up being faster than the buffering step above.

Implementing that change would be considerably less work, and a much bigger gain.

I've considered this, and might indeed submit such a PR, but I'm not yet sure if on balance it's worth it (and hence what my opinion would be if someone else submits such a PR) since it's almost always easier for the the caller to turn ….OrderBy(foo).Contains(bar) into .Contains(bar) themselves, and the check needed by optimising for such a case would be cheap, but not entirely free.

Sriram Sakthivel · Answer

To be able to use Binary Search, you need some kind of sorted data structure. Perhaps a sorted array or SortedList. But you have just got a IOrderedEnumerable implementation only; the query isn't materialized yet.

Your IOrderedEnumerable could simple be created using Iterator blocks or some lazy queries(which is how it is usually generated). There is no real data structure there. There is no way you can get all the elements from IOrderedEnumerable or IEnumerable without enumerating them which is O(n).

So there is no way you can implement a Binary Search or something similar.

Why doesn't IOrderedEnumerable re-implement .Contains() to gain performance

Tags:

c#

contains

ienumerable

iorderedenumerable

Brondahl

2 Answers

Jon Hanna

Sriram Sakthivel

Recent Activity

Donate For Us

Why doesn't IOrderedEnumerable re-implement .Contains() to gain performance

Tags:

c#

contains

ienumerable

iorderedenumerable

Brondahl

2 Answers

Jon Hanna

Sriram Sakthivel

Related questions

Recent Activity

Donate For Us