Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In which cases are IEnumerable<T>.Count optimized?

Using reflector I have noticed that System.Linq.Enumerable.Count method has a condition in it to optimize it for the case when the IEnumerable<T> passed is in fact an ICollection<T>. If the cast succeeds the Count method does not need to iterate over every element, but can call the Count method of ICollection.

Based on this I was starting to think that IEnumerable<T> can be used like a readonly view of a collection, without having the performance loss that I originally expected based on the API of IEnumerable<T>

I was interested whether the optimization of the Count still holds when the IEnumerable<T> is a result of a Select statement over an ICollection, but based on reflected code this case is not optimized, and requires an iteration through all elements.

Do you draw the same conclusions from reflector? What could be the reason behind the lack of this optimization? I seems like there is a lot of time wasted in this common operation. Does the spec require that the each element is evaluated even if the Count can be determined without doing that?

like image 722
shojtsy Avatar asked Feb 01 '10 23:02

shojtsy


People also ask

Does IEnumerable have count?

IEnumerable doesn't have a Count method.

Which is faster any or count C#?

Any() is ALWAYS faster than . Count() > 0 ).

Which is faster count or any?

First, Any() without parameters is quicker than Count() as it does not iterate through the whole collection. Second, Any() improves the readability of your code, indicating that the developer just wants to check if there are any items in the collection and isn't concerned with the exact number of items.


2 Answers

It doesn't really matter that the result of Select is lazily evaluated. The Count is always equivalent to the count of the original collection so it could have certainly been retrieved directly by returning a specific object from Select that could be used to short-circuit evaluation of the Count method.

The reason it's not possible to optimize out evaluation of the Count() method on the return value of a Select call from something with determined count (like a List<T>) is that it could change the meaning of the program.

The selector function passed to Select method is allowed to have side effects and its side effects are required to happen deterministically, in a predetermined order.

Assume:

new[]{1,2,3}.Select(i => { Console.WriteLine(i); return 0; }).Count();

The documentation requires this code to print

1
2
3

Even though the count is really known from the start and could be optimized, optimization would change the behavior of the program. That's why you can't avoid enumeration of the collection anyway. That's exactly one of the reasons why compiler optimizations are much easier in pure functional languages.


UPDATE: Apparently, it's not clear that it's perfectly possible to implement Select and Count so that Selects on ICollection<T> will still be lazily evaluated but the Count() will be evaluated in O(1) without enumerating the collection. I'm going to do that without changing the interface of any methods. A similar thing is already done for ICollection<T>:

private interface IDirectlyCountable {
    int Count {get;}
}
private class SelectICollectionIterator<TSource,TResult> : IEnumerable<T>, IDirectlyCountable {
     ICollection<TSource> sequence;
     Func<TSource,TResult> selector;
     public SelectICollectionIterator(ICollection<TSource> source, Func<TSource,TResult> selector) {
         this.sequence = source;
         this.selector = selector;
     }
     public int Count { get { return sequence.Count; } }
     // ... GetEnumerator ... 
}
public static IEnumerable<TResult> Select<TSource,TResult>(this IEnumerable<TSource> source, Func<TSource,TResult> selector) {
    // ... error handling omitted for brevity ...
    if (source is ICollection<TSource>)
       return new SelectICollectionIterator<TSource,TResult>((ICollection<TSource>)source, selector);
    // ... rest of the method ...
}
public static int Count<T>(this IEnumerable<T> source) {
    // ...
    ICollection<T> collection = source as ICollection<T>;
    if (collection != null) return collection.Count;
    IDirectlyCountable countableSequence = source as IDirectlyCountable;
    if (countableSequence != null) return countableSequence.Count;
    // ... enumerate and count the sequence ...
}

This will still evaluate the Count lazily. If you change the underlying collection, the count will get changed and the sequence is not cached. The only difference will be not doing the side effects in the selector delegate.

like image 84
mmx Avatar answered Oct 22 '22 13:10

mmx


Edit 02-Feb-2010:

As I see it, there are at least two ways to interpret this question.

Why does the Select<T, TResult> extension method, when called on an instance of a class that implements ICollection<T>, not return an object that provides a Count property; and why does the Count<T> extension method not check for this property so as to provide O(1) performance when the two methods are chained?

This version of the question makes no false assumptions about how Linq extensions work, and is a valid question since a call to ICollection<T>.Select.Count will, after all, always return the same value as ICollection<T>.Count. This is how Mehrdad interpreted the question, to which he has provided a thorough response.

But I read the question as asking...

If the Count<T> extension method provides O(1) performance for an object of a class implementing ICollection<T>, why does it provide O(n) performance for the return value of the Select<T, TResult> extension method?

In this version of the question, there is a mistaken assumption: that the Linq extension methods work together by assembling little collections one after another (in memory) and exposing them through the IEnumerable<T> interface.

If this were how the Linq extensions worked, the Select method might look something like this:

public static IEnumerable<TResult> Select<T, TResult>(this IEnumerable<T> source, Func<T, TResult> selector) {
    List<TResult> results = new List<TResult>();

    foreach (T input in source)
        results.Add(selector(input));

    return results;
}

Moreover, if this were the implementation of Select, I think you'd find most code that utilizes this method would behave just the same. But it would be wasteful, and would in fact cause exceptions in certain cases like the one I described in my original answer.

In reality, I believe the implementation of the Select method is much closer to something like this:

public static IEnumerable<TResult> Select<T, TResult>(this IEnumerable<T> source, Func<T, TResult> selector) {
    foreach (T input in source)
        yield return selector(input);

    yield break;
}

This is to provide lazy evaluation, and explains why a Count property is not accessible in O(1) time to the Count method.

So in other words, whereas Mehrdad answered the question of why Select wasn't designed differently so that Select.Count would behave differently, I have offered my best answer to the question of why Select.Count behaves the way it does.


ORIGINAL ANSWER:

Method side effects is not the answer.

According to Mehrdad's answer:

It doesn't really matter that the result of Select is lazily evaluated.

I don't buy this. Let me explain why.

For starters, consider the following two very similar methods:

public static IEnumerable<double> GetRandomsAsEnumerable(int N) {
    Random r = new Random();

    for (int i = 0; i < N; ++i)
        yield return r.NextDouble();

    yield break;
}

public static double[] GetRandomsAsArray(int N) {
    Random r = new Random();

    double[] values = new double[N];
    for (int i = 0; i < N; ++i)
        values[i] = r.NextDouble();

    return values;
}

OK, what do these methods do? Each one returns as many random doubles as the user desires (up to int.MaxValue). Does it matter whether either method is lazily evaluated or not? To answer this question, let's take a look at the following code:

public static double Invert(double value) {
    return 1.0 / value;
}

public static void Test() {
    int a = GetRandomsAsEnumerable(int.MaxValue).Select(Invert).Count();
    int b = GetRandomsAsArray(int.MaxValue).Select(Invert).Count();
}

Can you guess what will happen with these two method calls? Let me spare you the trouble of copying this code and testing it out yourself:

The first variable, a, will (after a potentially significant amount of time) be initialized to int.MaxValue (currently 2147483647). The second one, b, will very likely be interrupted by an OutOfMemoryException.

Because Select and the other Linq extension methods are lazily evaluated, they allow you to do things you simply could not do otherwise. The above is a fairly trivial example. But my main point is to dispute the assertion that lazy evaluation is not important. Mehrdad's statement that a Count property "is really known from the start and could be optimized" actually begs the question. The issue might seem straightforward for the Select method, but Select is not really special; it returns an IEnumerable<T> just like the rest of the Linq extension methods, and for these methods to "know" the Count of their return values would require full collections to be cached and therefore prohibit lazy evaluation.

Lazy evaluation is the answer.

For this reason, I have to agree with one of the original responders (whose answer now seems to have disappeared) that lazy evaluation really is the answer here. The idea that method side effects need to be accounted for is really secondary, as this is already ensured as a byproduct of lazy evaluation anyway.

Postscript: I've made very assertive statements and emphasized my points mainly because I wanted to be clear on what my argument is, not out of any disrespect for any other responses, including Mehrdad's, which I feel is insightful but misses the mark.

like image 44
Dan Tao Avatar answered Oct 22 '22 14:10

Dan Tao