Using reflector I have noticed that System.Linq.Enumerable.Count
method has a condition in it to optimize it for the case when the IEnumerable<T>
passed is in fact an ICollection<T>
. If the cast succeeds the Count method does not need to iterate over every element, but can call the Count method of ICollection.
Based on this I was starting to think that IEnumerable<T>
can be used like a readonly view of a collection, without having the performance loss that I originally expected based on the API of IEnumerable<T>
I was interested whether the optimization of the Count
still holds when the IEnumerable<T>
is a result of a Select
statement over an ICollection
, but based on reflected code this case is not optimized, and requires an iteration through all elements.
Do you draw the same conclusions from reflector? What could be the reason behind the lack of this optimization? I seems like there is a lot of time wasted in this common operation. Does the spec require that the each element is evaluated even if the Count can be determined without doing that?
IEnumerable doesn't have a Count method.
Any() is ALWAYS faster than . Count() > 0 ).
First, Any() without parameters is quicker than Count() as it does not iterate through the whole collection. Second, Any() improves the readability of your code, indicating that the developer just wants to check if there are any items in the collection and isn't concerned with the exact number of items.
It doesn't really matter that the result of Select
is lazily evaluated. The Count
is always equivalent to the count of the original collection so it could have certainly been retrieved directly by returning a specific object from Select
that could be used to short-circuit evaluation of the Count
method.
The reason it's not possible to optimize out evaluation of the Count()
method on the return value of a Select
call from something with determined count (like a List<T>
) is that it could change the meaning of the program.
The selector
function passed to Select
method is allowed to have side effects and its side effects are required to happen deterministically, in a predetermined order.
Assume:
new[]{1,2,3}.Select(i => { Console.WriteLine(i); return 0; }).Count();
The documentation requires this code to print
1
2
3
Even though the count is really known from the start and could be optimized, optimization would change the behavior of the program. That's why you can't avoid enumeration of the collection anyway. That's exactly one of the reasons why compiler optimizations are much easier in pure functional languages.
UPDATE: Apparently, it's not clear that it's perfectly possible to implement Select
and Count
so that Select
s on ICollection<T>
will still be lazily evaluated but the Count()
will be evaluated in O(1) without enumerating the collection. I'm going to do that without changing the interface of any methods. A similar thing is already done for ICollection<T>
:
private interface IDirectlyCountable {
int Count {get;}
}
private class SelectICollectionIterator<TSource,TResult> : IEnumerable<T>, IDirectlyCountable {
ICollection<TSource> sequence;
Func<TSource,TResult> selector;
public SelectICollectionIterator(ICollection<TSource> source, Func<TSource,TResult> selector) {
this.sequence = source;
this.selector = selector;
}
public int Count { get { return sequence.Count; } }
// ... GetEnumerator ...
}
public static IEnumerable<TResult> Select<TSource,TResult>(this IEnumerable<TSource> source, Func<TSource,TResult> selector) {
// ... error handling omitted for brevity ...
if (source is ICollection<TSource>)
return new SelectICollectionIterator<TSource,TResult>((ICollection<TSource>)source, selector);
// ... rest of the method ...
}
public static int Count<T>(this IEnumerable<T> source) {
// ...
ICollection<T> collection = source as ICollection<T>;
if (collection != null) return collection.Count;
IDirectlyCountable countableSequence = source as IDirectlyCountable;
if (countableSequence != null) return countableSequence.Count;
// ... enumerate and count the sequence ...
}
This will still evaluate the Count
lazily. If you change the underlying collection, the count will get changed and the sequence is not cached. The only difference will be not doing the side effects in the selector
delegate.
Edit 02-Feb-2010:
As I see it, there are at least two ways to interpret this question.
Why does the
Select<T, TResult>
extension method, when called on an instance of a class that implementsICollection<T>
, not return an object that provides aCount
property; and why does theCount<T>
extension method not check for this property so as to provide O(1) performance when the two methods are chained?
This version of the question makes no false assumptions about how Linq extensions work, and is a valid question since a call to ICollection<T>.Select.Count
will, after all, always return the same value as ICollection<T>.Count
. This is how Mehrdad interpreted the question, to which he has provided a thorough response.
But I read the question as asking...
If the
Count<T>
extension method provides O(1) performance for an object of a class implementingICollection<T>
, why does it provide O(n) performance for the return value of theSelect<T, TResult>
extension method?
In this version of the question, there is a mistaken assumption: that the Linq extension methods work together by assembling little collections one after another (in memory) and exposing them through the IEnumerable<T>
interface.
If this were how the Linq extensions worked, the Select
method might look something like this:
public static IEnumerable<TResult> Select<T, TResult>(this IEnumerable<T> source, Func<T, TResult> selector) {
List<TResult> results = new List<TResult>();
foreach (T input in source)
results.Add(selector(input));
return results;
}
Moreover, if this were the implementation of Select
, I think you'd find most code that utilizes this method would behave just the same. But it would be wasteful, and would in fact cause exceptions in certain cases like the one I described in my original answer.
In reality, I believe the implementation of the Select
method is much closer to something like this:
public static IEnumerable<TResult> Select<T, TResult>(this IEnumerable<T> source, Func<T, TResult> selector) {
foreach (T input in source)
yield return selector(input);
yield break;
}
This is to provide lazy evaluation, and explains why a Count
property is not accessible in O(1) time to the Count
method.
So in other words, whereas Mehrdad answered the question of why Select
wasn't designed differently so that Select.Count
would behave differently, I have offered my best answer to the question of why Select.Count
behaves the way it does.
ORIGINAL ANSWER:
According to Mehrdad's answer:
It doesn't really matter that the result of Select is lazily evaluated.
I don't buy this. Let me explain why.
For starters, consider the following two very similar methods:
public static IEnumerable<double> GetRandomsAsEnumerable(int N) {
Random r = new Random();
for (int i = 0; i < N; ++i)
yield return r.NextDouble();
yield break;
}
public static double[] GetRandomsAsArray(int N) {
Random r = new Random();
double[] values = new double[N];
for (int i = 0; i < N; ++i)
values[i] = r.NextDouble();
return values;
}
OK, what do these methods do? Each one returns as many random doubles as the user desires (up to int.MaxValue
). Does it matter whether either method is lazily evaluated or not? To answer this question, let's take a look at the following code:
public static double Invert(double value) {
return 1.0 / value;
}
public static void Test() {
int a = GetRandomsAsEnumerable(int.MaxValue).Select(Invert).Count();
int b = GetRandomsAsArray(int.MaxValue).Select(Invert).Count();
}
Can you guess what will happen with these two method calls? Let me spare you the trouble of copying this code and testing it out yourself:
The first variable, a
, will (after a potentially significant amount of time) be initialized to int.MaxValue
(currently 2147483647). The second one, b
, will very likely be interrupted by an OutOfMemoryException
.
Because Select
and the other Linq extension methods are lazily evaluated, they allow you to do things you simply could not do otherwise. The above is a fairly trivial example. But my main point is to dispute the assertion that lazy evaluation is not important. Mehrdad's statement that a Count
property "is really known from the start and could be optimized" actually begs the question. The issue might seem straightforward for the Select
method, but Select
is not really special; it returns an IEnumerable<T>
just like the rest of the Linq extension methods, and for these methods to "know" the Count
of their return values would require full collections to be cached and therefore prohibit lazy evaluation.
For this reason, I have to agree with one of the original responders (whose answer now seems to have disappeared) that lazy evaluation really is the answer here. The idea that method side effects need to be accounted for is really secondary, as this is already ensured as a byproduct of lazy evaluation anyway.
Postscript: I've made very assertive statements and emphasized my points mainly because I wanted to be clear on what my argument is, not out of any disrespect for any other responses, including Mehrdad's, which I feel is insightful but misses the mark.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With