During application profiling I found that function checking for pattern match is very slow. It is written using LINQ. Simple replacement of this LINQ expression with loop makes really huge difference. What is it? Is LINQ really such a bad thing and works so slow or I misunderstand something?
private static bool PatternMatch1(byte[] buffer, int position, string pattern)
{
int i = 0;
foreach (char c in pattern)
{
if (buffer[position + i++] != c)
{
return false;
}
}
return true;
}
version 2 with LINQ (suggested by Resharper)
private static bool PatternMatch2(byte[] buffer, int position, string pattern)
{
int i = 0;
return pattern.All(c => buffer[position + i++] == c);
}
version 3 with LINQ
private static bool PatternMatch3(byte[] buffer, int position, string pattern)
{
return !pattern.Where((t, i) => buffer[position + i] != t).Any();
}
version 4 using lambda
private static bool PatternMatch4(byte[] buffer, int position, string pattern, Func<char, byte, bool> predicate)
{
int i = 0;
foreach (char c in pattern)
{
if (predicate(c, buffer[position + i++]))
{
return false;
}
}
return true;
}
and here is the usage with large buffer
const int SIZE = 1024 * 1024 * 50;
byte[] buffer = new byte[SIZE];
for (int i = 0; i < SIZE - 3; ++i)
{
if (PatternMatch1(buffer, i, "xxx"))
{
Console.WriteLine(i);
}
}
Calling PatternMatch2
or PatternMatch3
is phenomenally slow. It takes about 8 seconds for PatternMatch3
and about 4 seconds for PatternMatch2
, while calling PatternMatch1
takes about 0.6. As far as I understand it is the same code and I see no difference. Any ideas?
It is slightly slower This is acknowledged in the Microsoft documentation: LINQ syntax is typically less efficient than a foreach loop. It's good to be aware of any performance tradeoff that might occur when you use LINQ to improve the readability of your code.
Conclusion. It would seem the performance of LINQ is similar to more basic constructs in C#, except for that notable case where Count was significantly slower. If performance is important it's crucial to do benchmarks on your application rather than relying on anecdotes (including this one).
We can see right away that LINQ is a lot slower than raw SQL, but compiled LINQ is a bit faster. Note that results are in microseconds; real-world queries may take tens or even hundreds of milliseconds, so LINQ overhead will be hardly noticeable.
Mark Byers and Marco Mp are right about the extra calls overhead. However, there is another reason here: many object allocations, due to a closure.
The compiler creates a new object on each iteration, storing the current values of buffer
, position
and i
that the predicate can use. Here is a dotTrace snapshot of PatternMatch2
showing this. <>c_DisplayClass1
is the compiler-generated class.
(Note that the numbers are huge because I'm using tracing profiling, which adds overhead. However, it's the same overhead per call, so the overall percentages are correct).
Well let's take the Where operator.
It's implementation is almost(*) like:
public IEnumerable<T> Where(this IEnumerable<T> input, Func<T, bool> fn)
{
foreach(T i in input)
if (fn(i))
yield return i;
}
This means, for each loop over the IEnumerable an iterator object is created - note that you have SIZE - n of these allocations because you do those many LINQ queries.
Then for each character in the pattern you have:
The second is a call to a delegate which costs roughly double the calling cost of a typical virtual call (where in the first version you have no extra calls apart the array deindexing.
If you really want brute performance, you probably want to get as "older-style" code as possible. In this case I would even replace that foreach in method 1 with a plain for (at the very least if it does not serve to optimize, it makes it more readable since you are keeping track of the index anyway).
It's also more readable in this case, and it shows that Resharper suggestions are sometimes debatable.
(*) I said almost because it uses a proxy method to check that the input enumerable is not null and throw the exception earlier than the time the collection is enumerated -- it's a small detail which does not invalidate what I wrote before, highlighting here for completeness.
The difference is that the LINQ version has extra function calls. Internally in LINQ your lambda function is called in a loop.
While the extra call might be optimized away by the JIT compiler, it's not guaranteed and it can add a small overhead. In most cases the extra overhead won't matter, but because your function is very simple and is called an extremely large number of times, even a small overhead can quickly add up. In my experience LINQ code generally is a little bit slower than equivalent for
loops. That's the performance price you often pay for a more high level syntax.
In this situation, I would recommend sticking with the explicit loop.
While you're optimizing this section of code, you might also want to consider a more efficient searching algorithm. Your algorithm is worst case O(n*m) but better algorithms exist.
LINQ's main goal when applied to collections is its simplicity. If you want performance, you should avoid LINQ entirely. Also to enumerate an array you just increment an index variable, while LINQ needs to set up an entire enumerator object.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With