Using reflector I have noticed that <code>System.Linq.Enumerable.Count</code> method has a condition in it to optimize it for the case when the <code>IEnumerable<T></code> passed is in fact an <code>ICollection<T></code>. If the cast succeeds the Count method does not need to iterate over every element, but can call the Count method of ICollection. Based on this I was starting to think that <code>IEnumerable<T></code> can be used like a readonly view of a collection, without having the performance loss that I originally expected based on the API of <code>IEnumerable<T></code> I was interested whether the optimization of the <code>Count</code> still holds when the <code>IEnumerable<T></code> is a result of a <code>Select</code> statement over an <code>ICollection</code>, but based on reflected code this case is not optimized, and requires an iteration through all elements. Do you draw the same conclusions from reflector? What could be the reason behind the lack of this optimization? I seems like there is a lot of time wasted in this common operation. Does the spec require that the each element is evaluated even if the Count can be determined without doing that?

Edit 02-Feb-2010: As I see it, there are at least two ways to interpret this question. <blockquote> Why does the <code>Select<T, TResult></code> extension method, when called on an instance of a class that implements <code>ICollection<T></code>, not return an object that provides a <code>Count</code> property; and why does the <code>Count<T></code> extension method not check for this property so as to provide O(1) performance when the two methods are chained? </blockquote> This version of the question makes no false assumptions about how Linq extensions work, and is a valid question since a call to <code>ICollection<T>.Select.Count</code> will, after all, always return the same value as <code>ICollection<T>.Count</code>. This is how Mehrdad interpreted the question, to which he has provided a thorough response. But I read the question as asking... <blockquote> If the <code>Count<T></code> extension method provides O(1) performance for an object of a class implementing <code>ICollection<T></code>, why does it provide O(n) performance for the return value of the <code>Select<T, TResult></code> extension method? </blockquote> In this version of the question, there is a mistaken assumption: that the Linq extension methods work together by assembling little collections one after another (in memory) and exposing them through the <code>IEnumerable<T></code> interface. If this were how the Linq extensions worked, the <code>Select</code> method might look something like this: <pre class="prettyprint"><code>public static IEnumerable<TResult> Select<T, TResult>(this IEnumerable<T> source, Func<T, TResult> selector) { List<TResult> results = new List<TResult>(); foreach (T input in source) results.Add(selector(input)); return results; } </code></pre> Moreover, if this were the implementation of <code>Select</code>, I think you'd find most code that utilizes this method would behave just the same. But it would be wasteful, and would in fact cause exceptions in certain cases like the one I described in my original answer. In reality, I believe the implementation of the <code>Select</code> method is much closer to something like this: <pre class="prettyprint"><code>public static IEnumerable<TResult> Select<T, TResult>(this IEnumerable<T> source, Func<T, TResult> selector) { foreach (T input in source) yield return selector(input); yield break; } </code></pre> This is to provide lazy evaluation, and explains why a <code>Count</code> property is not accessible in O(1) time to the <code>Count</code> method. So in other words, whereas Mehrdad answered the question of why <code>Select</code> wasn't designed differently so that <code>Select.Count</code> would behave differently, I have offered my best answer to the question of why <code>Select.Count</code> behaves the way it does. <hr> ORIGINAL ANSWER: <h3>Method side effects is not the answer.</h3> According to Mehrdad's answer: <blockquote> It doesn't really matter that the result of Select is lazily evaluated. </blockquote> I don't buy this. Let me explain why. For starters, consider the following two very similar methods: <pre class="prettyprint"><code>public static IEnumerable<double> GetRandomsAsEnumerable(int N) { Random r = new Random(); for (int i = 0; i < N; ++i) yield return r.NextDouble(); yield break; } public static double[] GetRandomsAsArray(int N) { Random r = new Random(); double[] values = new double[N]; for (int i = 0; i < N; ++i) values[i] = r.NextDouble(); return values; } </code></pre> OK, what do these methods do? Each one returns as many random doubles as the user desires (up to <code>int.MaxValue</code>). Does it matter whether either method is lazily evaluated or not? To answer this question, let's take a look at the following code: <pre class="prettyprint"><code>public static double Invert(double value) { return 1.0 / value; } public static void Test() { int a = GetRandomsAsEnumerable(int.MaxValue).Select(Invert).Count(); int b = GetRandomsAsArray(int.MaxValue).Select(Invert).Count(); } </code></pre> Can you guess what will happen with these two method calls? Let me spare you the trouble of copying this code and testing it out yourself: The first variable, <code>a</code>, will (after a potentially significant amount of time) be initialized to <code>int.MaxValue</code> (currently 2147483647). The second one, <code>b</code>, will very likely be interrupted by an <code>OutOfMemoryException</code>. Because <code>Select</code> and the other Linq extension methods are lazily evaluated, they allow you to do things you simply could not do otherwise. The above is a fairly trivial example. But my main point is to dispute the assertion that lazy evaluation is not important. Mehrdad's statement that a <code>Count</code> property "is really known from the start and could be optimized" actually begs the question. The issue might seem straightforward for the <code>Select</code> method, but <code>Select</code> is not really special; it returns an <code>IEnumerable<T></code> just like the rest of the Linq extension methods, and for these methods to "know" the <code>Count</code> of their return values would require full collections to be cached and therefore prohibit lazy evaluation. <h3>Lazy evaluation is the answer.</h3> For this reason, I have to agree with one of the original responders (whose answer now seems to have disappeared) that lazy evaluation really is the answer here. The idea that method side effects need to be accounted for is really secondary, as this is already ensured as a byproduct of lazy evaluation anyway. Postscript: I've made very assertive statements and emphasized my points mainly because I wanted to be clear on what my argument is, not out of any disrespect for any other responses, including Mehrdad's, which I feel is insightful but misses the mark.

In which cases are IEnumerable<T>.Count optimized?

Tags:

c#

optimization

count

ienumerable

Using reflector I have noticed that System.Linq.Enumerable.Count method has a condition in it to optimize it for the case when the IEnumerable<T> passed is in fact an ICollection<T>. If the cast succeeds the Count method does not need to iterate over every element, but can call the Count method of ICollection.

Based on this I was starting to think that IEnumerable<T> can be used like a readonly view of a collection, without having the performance loss that I originally expected based on the API of IEnumerable<T>

I was interested whether the optimization of the Count still holds when the IEnumerable<T> is a result of a Select statement over an ICollection, but based on reflected code this case is not optimized, and requires an iteration through all elements.

Do you draw the same conclusions from reflector? What could be the reason behind the lack of this optimization? I seems like there is a lot of time wasted in this common operation. Does the spec require that the each element is evaluated even if the Count can be determined without doing that?

722

asked Feb 01 '10 23:02

shojtsy

2 Answers

It doesn't really matter that the result of Select is lazily evaluated. The Count is always equivalent to the count of the original collection so it could have certainly been retrieved directly by returning a specific object from Select that could be used to short-circuit evaluation of the Count method.

The reason it's not possible to optimize out evaluation of the Count() method on the return value of a Select call from something with determined count (like a List<T>) is that it could change the meaning of the program.

The selector function passed to Select method is allowed to have side effects and its side effects are required to happen deterministically, in a predetermined order.

Assume:

new[]{1,2,3}.Select(i => { Console.WriteLine(i); return 0; }).Count();

The documentation requires this code to print

1
2
3

Even though the count is really known from the start and could be optimized, optimization would change the behavior of the program. That's why you can't avoid enumeration of the collection anyway. That's exactly one of the reasons why compiler optimizations are much easier in pure functional languages.

UPDATE: Apparently, it's not clear that it's perfectly possible to implement Select and Count so that Selects on ICollection<T> will still be lazily evaluated but the Count() will be evaluated in O(1) without enumerating the collection. I'm going to do that without changing the interface of any methods. A similar thing is already done for ICollection<T>:

private interface IDirectlyCountable {
    int Count {get;}
}
private class SelectICollectionIterator<TSource,TResult> : IEnumerable<T>, IDirectlyCountable {
     ICollection<TSource> sequence;
     Func<TSource,TResult> selector;
     public SelectICollectionIterator(ICollection<TSource> source, Func<TSource,TResult> selector) {
         this.sequence = source;
         this.selector = selector;
     }
     public int Count { get { return sequence.Count; } }
     // ... GetEnumerator ... 
}
public static IEnumerable<TResult> Select<TSource,TResult>(this IEnumerable<TSource> source, Func<TSource,TResult> selector) {
    // ... error handling omitted for brevity ...
    if (source is ICollection<TSource>)
       return new SelectICollectionIterator<TSource,TResult>((ICollection<TSource>)source, selector);
    // ... rest of the method ...
}
public static int Count<T>(this IEnumerable<T> source) {
    // ...
    ICollection<T> collection = source as ICollection<T>;
    if (collection != null) return collection.Count;
    IDirectlyCountable countableSequence = source as IDirectlyCountable;
    if (countableSequence != null) return countableSequence.Count;
    // ... enumerate and count the sequence ...
}

This will still evaluate the Count lazily. If you change the underlying collection, the count will get changed and the sequence is not cached. The only difference will be not doing the side effects in the selector delegate.

answered Oct 22 '22 13:10

mmx

Edit 02-Feb-2010:

As I see it, there are at least two ways to interpret this question.

Why does the Select<T, TResult> extension method, when called on an instance of a class that implements ICollection<T>, not return an object that provides a Count property; and why does the Count<T> extension method not check for this property so as to provide O(1) performance when the two methods are chained?

This version of the question makes no false assumptions about how Linq extensions work, and is a valid question since a call to ICollection<T>.Select.Count will, after all, always return the same value as ICollection<T>.Count. This is how Mehrdad interpreted the question, to which he has provided a thorough response.

But I read the question as asking...

If the Count<T> extension method provides O(1) performance for an object of a class implementing ICollection<T>, why does it provide O(n) performance for the return value of the Select<T, TResult> extension method?

In this version of the question, there is a mistaken assumption: that the Linq extension methods work together by assembling little collections one after another (in memory) and exposing them through the IEnumerable<T> interface.

If this were how the Linq extensions worked, the Select method might look something like this:

public static IEnumerable<TResult> Select<T, TResult>(this IEnumerable<T> source, Func<T, TResult> selector) {
    List<TResult> results = new List<TResult>();

    foreach (T input in source)
        results.Add(selector(input));

    return results;
}

Moreover, if this were the implementation of Select, I think you'd find most code that utilizes this method would behave just the same. But it would be wasteful, and would in fact cause exceptions in certain cases like the one I described in my original answer.

In reality, I believe the implementation of the Select method is much closer to something like this:

public static IEnumerable<TResult> Select<T, TResult>(this IEnumerable<T> source, Func<T, TResult> selector) {
    foreach (T input in source)
        yield return selector(input);

    yield break;
}

This is to provide lazy evaluation, and explains why a Count property is not accessible in O(1) time to the Count method.

So in other words, whereas Mehrdad answered the question of why Select wasn't designed differently so that Select.Count would behave differently, I have offered my best answer to the question of why Select.Count behaves the way it does.

ORIGINAL ANSWER:

Method side effects is not the answer.

According to Mehrdad's answer:

It doesn't really matter that the result of Select is lazily evaluated.

I don't buy this. Let me explain why.

For starters, consider the following two very similar methods:

public static IEnumerable<double> GetRandomsAsEnumerable(int N) {
    Random r = new Random();

    for (int i = 0; i < N; ++i)
        yield return r.NextDouble();

    yield break;
}

public static double[] GetRandomsAsArray(int N) {
    Random r = new Random();

    double[] values = new double[N];
    for (int i = 0; i < N; ++i)
        values[i] = r.NextDouble();

    return values;
}

OK, what do these methods do? Each one returns as many random doubles as the user desires (up to int.MaxValue). Does it matter whether either method is lazily evaluated or not? To answer this question, let's take a look at the following code:

public static double Invert(double value) {
    return 1.0 / value;
}

public static void Test() {
    int a = GetRandomsAsEnumerable(int.MaxValue).Select(Invert).Count();
    int b = GetRandomsAsArray(int.MaxValue).Select(Invert).Count();
}

Can you guess what will happen with these two method calls? Let me spare you the trouble of copying this code and testing it out yourself:

The first variable, a, will (after a potentially significant amount of time) be initialized to int.MaxValue (currently 2147483647). The second one, b, will very likely be interrupted by an OutOfMemoryException.

Because Select and the other Linq extension methods are lazily evaluated, they allow you to do things you simply could not do otherwise. The above is a fairly trivial example. But my main point is to dispute the assertion that lazy evaluation is not important. Mehrdad's statement that a Count property "is really known from the start and could be optimized" actually begs the question. The issue might seem straightforward for the Select method, but Select is not really special; it returns an IEnumerable<T> just like the rest of the Linq extension methods, and for these methods to "know" the Count of their return values would require full collections to be cached and therefore prohibit lazy evaluation.

Lazy evaluation is the answer.

For this reason, I have to agree with one of the original responders (whose answer now seems to have disappeared) that lazy evaluation really is the answer here. The idea that method side effects need to be accounted for is really secondary, as this is already ensured as a byproduct of lazy evaluation anyway.

Postscript: I've made very assertive statements and emphasized my points mainly because I wanted to be clear on what my argument is, not out of any disrespect for any other responses, including Mehrdad's, which I feel is insightful but misses the mark.

answered Oct 22 '22 14:10

Dan Tao

Related questions
                            
                                RyuJIT not making full use of SIMD intrinsics
                            
                                Building one web project breaks the compiled version of the second in solution
                            
                                Azure ServiceBus AutoRenewTimeout
                            
                                How to connect to a Unix Domain Socket in .NET Core in C#
                            
                                Ambiguity in parameter type inference for C# lambda expressions
                            
                                What are the benefits of C# 7 local functions over lambdas? [duplicate]
                            
                                How to filter NUnit tests by category using "dotnet test"
                            
                                Route Name for HttpGet attribute Name for base generic controller class in asp.net core 2
                            
                                Why is Parallel.Invoke much faster if the call is in a separate method?
                            
                                Asp.net core web api using windows authentication - Cors request unauthorised
                            
                                error when url resource contains ampersand
                            
                                Does "where" position in LINQ query matter when joining in-memory?
                            
                                Clarification on how IAsyncEnumerable works with ASP.NET Web API
                            
                                Find out how much memory is being used by an object in C#?
                            
                                Multiple keyboards and low-level hooks
                            
                                Generic partial view: how to set a generic class as model?
                            
                                C# tutorial to write gadgets
                            
                                Canonical HTTP POST code?
                            
                                My EventWaitHandle says "Access to the path is denied", but its not
                            
                                How to serialize a class that contains objects of other classes (recursive serializing?)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With