Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are Where and Select outperforming just Select?

Tags:

c#

linq

Select iterates once over the entire set and, for each item, performs a conditional branch (checking for validity) and a + operation.

Where+Select creates an iterator that skips invalid elements (doesn't yield them), performing a + only on the valid items.

So, the cost for a Select is:

t(s) = n * ( cost(check valid) + cost(+) )

And for Where+Select:

t(ws) = n * ( cost(check valid) + p(valid) * (cost(yield) + cost(+)) )

Where:

  • p(valid) is the probability that an item in the list is valid.
  • cost(check valid) is the cost of the branch that checks for validity
  • cost(yield) is the cost of constructing the new state of the where iterator, which is more complex than the simple iterator that the Select version uses.

As you can see, for a given n, the Select version is a constant, whereas the Where+Select version is a linear equation with p(valid) as a variable. The actual values of the costs determine the intersection point of the two lines, and since cost(yield) can be different from cost(+), they don't necessarily intersect at p(valid)=0.5.


Here's an in-depth explanation of what's causing the timing-differences.


The Sum() function for IEnumerable<int> looks like this:

public static int Sum(this IEnumerable<int> source)
{
    int sum = 0;
    foreach(int item in source)
    {
        sum += item;
    }
    return sum;
}

In C#, foreach is just syntactic sugar for .Net's version of an iterator, IEnumerator<T> (not to be confused with IEnumerable<T>). So the above code is actually translated to this:

public static int Sum(this IEnumerable<int> source)
{
    int sum = 0;

    IEnumerator<int> iterator = source.GetEnumerator();
    while(iterator.MoveNext())
    {
        int item = iterator.Current;
        sum += item;
    }
    return sum;
}

Remember, the two lines of code you're comparing are the following

int result1 = myCollection.Where(mc => mc.IsValid).Sum(mc => mc.Value);
int result2 = myCollection.Sum(mc => mc.IsValid ? mc.Value : 0);

Now here's the kicker:

LINQ uses deferred execution. Thus, while it may appear that result1 iterates over the collection twice, it actually only iterates over it once. The Where() condition is actually applied during the Sum(), inside of the call to MoveNext() (This is possible thanks to the magic of yield return).

This means that, for result1, the code inside of the while loop,

{
    int item = iterator.Current;
    sum += item;
}

is only executed once for each item with mc.IsValid == true. By comparison, result2 will execute that code for every item in the collection. That is why result1 is generally faster.

(Though, note that calling the Where() condition within MoveNext() still has some small overhead, so if most/all of the items have mc.IsValid == true, result2 will actually be faster!)


Hopefully now it's clear why result2 is usually slower. Now I'd like to explain why I stated in the comments that these LINQ performance comparisons don't matter.

Creating a LINQ expression is cheap. Calling delegate functions is cheap. Allocating and looping over an iterator is cheap. But it's even cheaper to not do these things. Thus, if you find that a LINQ statement is the bottleneck in your program, in my experience rewriting it without LINQ will always make it faster than any of the various LINQ methods.

So, your LINQ workflow should look like this:

  1. Use LINQ everywhere.
  2. Profile.
  3. If the profiler says LINQ is the cause of a bottleneck, rewrite that piece of code without LINQ.

Fortunately, LINQ bottlenecks are rare. Heck, bottlenecks are rare. I've written hundreds of LINQ statements in the last few years, and have ended up replacing <1%. And most of those were due to LINQ2EF's poor SQL optimization, rather than being LINQ's fault.

So, like always, write clear and sensible code first, and wait until after you've profiled to worry about micro-optimizations.


Funny thing. Do you know how is Sum(this IEnumerable<TSource> source, Func<TSource, int> selector) defined? It uses Select method!

public static int Sum<TSource>(this IEnumerable<TSource> source, Func<TSource, int> selector)
{
    return source.Select(selector).Sum();
}

So actually, it all should work nearly the same. I did quick research on my own, and here are the results:

Where -- mod: 1 result: 0, time: 371 ms
WhereSelect -- mod: 1  result: 0, time: 356 ms
Select -- mod: 1  result 0, time: 366 ms
Sum -- mod: 1  result: 0, time: 363 ms
-------------
Where -- mod: 2 result: 4999999, time: 469 ms
WhereSelect -- mod: 2  result: 4999999, time: 429 ms
Select -- mod: 2  result 4999999, time: 362 ms
Sum -- mod: 2  result: 4999999, time: 358 ms
-------------
Where -- mod: 3 result: 9999999, time: 441 ms
WhereSelect -- mod: 3  result: 9999999, time: 452 ms
Select -- mod: 3  result 9999999, time: 371 ms
Sum -- mod: 3  result: 9999999, time: 380 ms
-------------
Where -- mod: 4 result: 7500000, time: 571 ms
WhereSelect -- mod: 4  result: 7500000, time: 501 ms
Select -- mod: 4  result 7500000, time: 406 ms
Sum -- mod: 4  result: 7500000, time: 397 ms
-------------
Where -- mod: 5 result: 7999999, time: 490 ms
WhereSelect -- mod: 5  result: 7999999, time: 477 ms
Select -- mod: 5  result 7999999, time: 397 ms
Sum -- mod: 5  result: 7999999, time: 394 ms
-------------
Where -- mod: 6 result: 9999999, time: 488 ms
WhereSelect -- mod: 6  result: 9999999, time: 480 ms
Select -- mod: 6  result 9999999, time: 391 ms
Sum -- mod: 6  result: 9999999, time: 387 ms
-------------
Where -- mod: 7 result: 8571428, time: 489 ms
WhereSelect -- mod: 7  result: 8571428, time: 486 ms
Select -- mod: 7  result 8571428, time: 384 ms
Sum -- mod: 7  result: 8571428, time: 381 ms
-------------
Where -- mod: 8 result: 8749999, time: 494 ms
WhereSelect -- mod: 8  result: 8749999, time: 488 ms
Select -- mod: 8  result 8749999, time: 386 ms
Sum -- mod: 8  result: 8749999, time: 373 ms
-------------
Where -- mod: 9 result: 9999999, time: 497 ms
WhereSelect -- mod: 9  result: 9999999, time: 494 ms
Select -- mod: 9  result 9999999, time: 386 ms
Sum -- mod: 9  result: 9999999, time: 371 ms

For following implementations:

result = source.Where(x => x.IsValid).Sum(x => x.Value);
result = source.Select(x => x.IsValid ? x.Value : 0).Sum();
result = source.Sum(x => x.IsValid ? x.Value : 0);
result = source.Where(x => x.IsValid).Select(x => x.Value).Sum();

mod means: every 1 from mod items is invalid: for mod == 1 every item is invalid, for mod == 2 odd items are invalid, etc. Collection contains 10000000 items.

enter image description here

And results for collection with 100000000 items:

enter image description here

As you can see, Select and Sum results are quite consistent across all mod values. However where and where+select are not.


My guess is that the version with Where filters out the 0s and they are not a subject for Sum (i.e. you are not executing the addition). This is of course a guess since I cannot explain how executing additional lambda expression and calling multiple methods outperforms a simple addition of a 0.

A friend of mine suggested that the fact that the 0 in the sum may cause severe performance penalty because of overflow checks. It would be interesting to see how this would perform in unchecked context.