To start with, this is not the same as Why is Func<> created from Expression> slower than Func<> declared directly? and is surprisingly just the opposite of it. Additionally, all links and questions that I have found while researching this issue all originate out of the 2010-2012 time period so I have decided to open a new question here to see if there is some discussion to be had around the current state of delegate behavior in the .NET ecosystem.
That said, I am using .NET Core 2.0 and .NET 4.7.1 and am seeing some curious performance metrics in regards to delegates that are created from a compiled expression versus delegates that are described and declared as a CLR object.
For some context on how I stumbled upon this issue, I was doing a test involving a selection of data in arrays of 1,000 and 10,000 objects, and noticed that if I used a compiled expression it was getting faster results across the board. I managed to boil this down to a very simple project that reproduces this issue which you can find here:
https://github.com/Mike-EEE/StackOverflow.Performance.Delegates
For the testing, I have two sets of benchmarks that are used that feature a compiled delegate paired with a declared delegate, resulting in four total core benchmarks.
The first delegate set is comprised of an empty delegate that returns a null string. The second set is a delegate that has a simple expression within it. I wanted to demonstrate that this issue occurs with the simplest of delegates as well as ones with a defined body within it.
These tests are then run on the CLR runtime and the .NET Core runtime via the excellent Benchmark.NET performance product, resulting in eight total benchmarks. Additionally, I also make use of the just-as-excellent Benchmark.NET disassembly diagnoser to emit the disassembly encountered during the JIT of the benchmark measurements. I share the results of this below.
Here is the code that runs the benchmarks. You can see that it is very straight-forward:
[CoreJob, ClrJob, DisassemblyDiagnoser(true, printSource: true)]
public class Delegates
{
readonly DelegatePair<string, string> _empty;
readonly DelegatePair<string, int> _expression;
readonly string _message;
public Delegates() : this(new DelegatePair<string, string>(_ => default, _ => default),
new DelegatePair<string, int>(x => x.Length, x => x.Length)) {}
public Delegates(DelegatePair<string, string> empty, DelegatePair<string, int> expression,
string message = "Hello World!")
{
_empty = empty;
_expression = expression;
_message = message;
EmptyDeclared();
EmptyCompiled();
ExpressionDeclared();
ExpressionCompiled();
}
[Benchmark]
public void EmptyDeclared() => _empty.Declared(default);
[Benchmark]
public void EmptyCompiled() => _empty.Compiled(default);
[Benchmark]
public void ExpressionDeclared() => _expression.Declared(_message);
[Benchmark]
public void ExpressionCompiled() => _expression.Compiled(_message);
}
These are the results I see in Benchmark.NET:
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.371 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-4820K CPU 3.70GHz (Haswell), 1 CPU, 8 logical and 8 physical cores
.NET Core SDK=2.1.300-preview2-008533
[Host] : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT
Clr : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2633.0
Core : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT
Method | Job | Runtime | Mean | Error | StdDev |
------------------- |----- |-------- |----------:|----------:|----------:|
EmptyDeclared | Clr | Clr | 1.3691 ns | 0.0302 ns | 0.0282 ns |
EmptyCompiled | Clr | Clr | 1.1851 ns | 0.0381 ns | 0.0357 ns |
ExpressionDeclared | Clr | Clr | 1.3805 ns | 0.0314 ns | 0.0294 ns |
ExpressionCompiled | Clr | Clr | 1.1431 ns | 0.0396 ns | 0.0371 ns |
EmptyDeclared | Core | Core | 1.5733 ns | 0.0329 ns | 0.0308 ns |
EmptyCompiled | Core | Core | 0.9326 ns | 0.0275 ns | 0.0244 ns |
ExpressionDeclared | Core | Core | 1.6040 ns | 0.0394 ns | 0.0368 ns |
ExpressionCompiled | Core | Core | 0.9380 ns | 0.0485 ns | 0.0631 ns |
Do note that the benchmarks that make use of a compiled delegate are consistently faster.
Finally, here are the results of the disassembly encountered for each benchmark:
<style type="text/css">
table { border-collapse: collapse; display: block; width: 100%; overflow: auto; }
td, th { padding: 6px 13px; border: 1px solid #ddd; }
tr { background-color: #fff; border-top: 1px solid #ccc; }
tr:nth-child(even) { background: #f8f8f8; }
</style>
</head>
<body>
<table>
<thead>
<tr><th colspan="2">Delegates.EmptyDeclared</th></tr>
<tr>
<th>.NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2633.0</th>
<th>.NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align:top;"><pre><code>
00007ffd`4f8f0ea0 StackOverflow.Performance.Delegates.Delegates.EmptyDeclared()
public void EmptyDeclared() => _empty.Declared(default);
^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`4f8f0ea4 4883c110 add rcx,10h
00007ffd`4f8f0ea8 488b01 mov rax,qword ptr [rcx]
00007ffd`4f8f0eab 488b4808 mov rcx,qword ptr [rax+8]
00007ffd`4f8f0eaf 33d2 xor edx,edx
00007ffd`4f8f0eb1 ff5018 call qword ptr [rax+18h]
00007ffd`4f8f0eb4 90 nop
</code></pre></td>
<td style="vertical-align:top;"><pre><code>
00007ffd`39c8d8b0 StackOverflow.Performance.Delegates.Delegates.EmptyDeclared()
public void EmptyDeclared() => _empty.Declared(default);
^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`39c8d8b4 4883c110 add rcx,10h
00007ffd`39c8d8b8 488b01 mov rax,qword ptr [rcx]
00007ffd`39c8d8bb 488b4808 mov rcx,qword ptr [rax+8]
00007ffd`39c8d8bf 33d2 xor edx,edx
00007ffd`39c8d8c1 ff5018 call qword ptr [rax+18h]
00007ffd`39c8d8c4 90 nop
</code></pre></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr><th colspan="2">Delegates.EmptyCompiled</th></tr>
<tr>
<th>.NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2633.0</th>
<th>.NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align:top;"><pre><code>
00007ffd`4f8e0ef0 StackOverflow.Performance.Delegates.Delegates.EmptyCompiled()
public void EmptyCompiled() => _empty.Compiled(default);
^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`4f8e0ef4 4883c110 add rcx,10h
00007ffd`4f8e0ef8 488b4108 mov rax,qword ptr [rcx+8]
00007ffd`4f8e0efc 488b4808 mov rcx,qword ptr [rax+8]
00007ffd`4f8e0f00 33d2 xor edx,edx
00007ffd`4f8e0f02 ff5018 call qword ptr [rax+18h]
00007ffd`4f8e0f05 90 nop
</code></pre></td>
<td style="vertical-align:top;"><pre><code>
00007ffd`39c8d900 StackOverflow.Performance.Delegates.Delegates.EmptyCompiled()
public void EmptyCompiled() => _empty.Compiled(default);
^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`39c8d904 4883c110 add rcx,10h
00007ffd`39c8d908 488b4108 mov rax,qword ptr [rcx+8]
00007ffd`39c8d90c 488b4808 mov rcx,qword ptr [rax+8]
00007ffd`39c8d910 33d2 xor edx,edx
00007ffd`39c8d912 ff5018 call qword ptr [rax+18h]
00007ffd`39c8d915 90 nop
</code></pre></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr><th colspan="2">Delegates.ExpressionDeclared</th></tr>
<tr>
<th>.NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2633.0</th>
<th>.NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align:top;"><pre><code>
00007ffd`4f8e0f20 StackOverflow.Performance.Delegates.Delegates.ExpressionDeclared()
public void ExpressionDeclared() => _expression.Declared(_message);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`4f8e0f24 488d5120 lea rdx,[rcx+20h]
00007ffd`4f8e0f28 488b02 mov rax,qword ptr [rdx]
00007ffd`4f8e0f2b 488b5108 mov rdx,qword ptr [rcx+8]
00007ffd`4f8e0f2f 488b4808 mov rcx,qword ptr [rax+8]
00007ffd`4f8e0f33 ff5018 call qword ptr [rax+18h]
00007ffd`4f8e0f36 90 nop
</code></pre></td>
<td style="vertical-align:top;"><pre><code>
00007ffd`39c9d930 StackOverflow.Performance.Delegates.Delegates.ExpressionDeclared()
public void ExpressionDeclared() => _expression.Declared(_message);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`39c9d934 488d5120 lea rdx,[rcx+20h]
00007ffd`39c9d938 488b02 mov rax,qword ptr [rdx]
00007ffd`39c9d93b 488b5108 mov rdx,qword ptr [rcx+8]
00007ffd`39c9d93f 488b4808 mov rcx,qword ptr [rax+8]
00007ffd`39c9d943 ff5018 call qword ptr [rax+18h]
00007ffd`39c9d946 90 nop
</code></pre></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr><th colspan="2">Delegates.ExpressionCompiled</th></tr>
<tr>
<th>.NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2633.0</th>
<th>.NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align:top;"><pre><code>
00007ffd`4f8f0f70 StackOverflow.Performance.Delegates.Delegates.ExpressionCompiled()
public void ExpressionCompiled() => _expression.Compiled(_message);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`4f8f0f74 488d5120 lea rdx,[rcx+20h]
00007ffd`4f8f0f78 488b4208 mov rax,qword ptr [rdx+8]
00007ffd`4f8f0f7c 488b5108 mov rdx,qword ptr [rcx+8]
00007ffd`4f8f0f80 488b4808 mov rcx,qword ptr [rax+8]
00007ffd`4f8f0f84 ff5018 call qword ptr [rax+18h]
00007ffd`4f8f0f87 90 nop
</code></pre></td>
<td style="vertical-align:top;"><pre><code>
00007ffd`39c9d980 StackOverflow.Performance.Delegates.Delegates.ExpressionCompiled()
public void ExpressionCompiled() => _expression.Compiled(_message);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`39c9d984 488d5120 lea rdx,[rcx+20h]
00007ffd`39c9d988 488b4208 mov rax,qword ptr [rdx+8]
00007ffd`39c9d98c 488b5108 mov rdx,qword ptr [rcx+8]
00007ffd`39c9d990 488b4808 mov rcx,qword ptr [rax+8]
00007ffd`39c9d994 ff5018 call qword ptr [rax+18h]
00007ffd`39c9d997 90 nop
</code></pre></td>
</tr>
</tbody>
</table>
It would seem that the only difference between declared and compiled delegate disassembly is the rcx
for declared vs. the rcx+8
for compiled used within their respective first mov
operations. I am not yet that well-spoken in disassembly, so getting context around this would be greatly appreciated. At first glance, it would not seem that this would cause the difference/improvement, and if so, the native-declared delegate should feature it as well (so in other words, a bug).
With all of this stated, the obvious questions to me are:
For completeness, here is all of the code used in the sample here in its entirety:
sealed class Program
{
static void Main()
{
BenchmarkRunner.Run<Delegates>();
}
}
[CoreJob, ClrJob, DisassemblyDiagnoser(true, printSource: true)]
public class Delegates
{
readonly DelegatePair<string, string> _empty;
readonly DelegatePair<string, int> _expression;
readonly string _message;
public Delegates() : this(new DelegatePair<string, string>(_ => default, _ => default),
new DelegatePair<string, int>(x => x.Length, x => x.Length)) {}
public Delegates(DelegatePair<string, string> empty, DelegatePair<string, int> expression,
string message = "Hello World!")
{
_empty = empty;
_expression = expression;
_message = message;
EmptyDeclared();
EmptyCompiled();
ExpressionDeclared();
ExpressionCompiled();
}
[Benchmark]
public void EmptyDeclared() => _empty.Declared(default);
[Benchmark]
public void EmptyCompiled() => _empty.Compiled(default);
[Benchmark]
public void ExpressionDeclared() => _expression.Declared(_message);
[Benchmark]
public void ExpressionCompiled() => _expression.Compiled(_message);
}
public struct DelegatePair<TFrom, TTo>
{
DelegatePair(Func<TFrom, TTo> declared, Func<TFrom, TTo> compiled)
{
Declared = declared;
Compiled = compiled;
}
public DelegatePair(Func<TFrom, TTo> declared, Expression<Func<TFrom, TTo>> expression) :
this(declared, expression.Compile()) {}
public Func<TFrom, TTo> Declared { get; }
public Func<TFrom, TTo> Compiled { get; }
}
Thank you in advance for any assistance that you can provide!
Advantages to using them in design:Allow you to develop libraries and classes that are easily extensible, since it provides an easy way to hook in other functionality (for example, a where clause in LINQ can use a delegate [Func<T,bool>] to filter on, without having to write new code in the Where method.
Declaration of DelegatesDelegate type can be declared using the delegate keyword. Once a delegate is declared, delegate instance will refer and call those methods whose return type and parameter-list matches with the delegate declaration.
Delegates allow methods to be passed as parameters. Delegates can be used to define callback methods. Delegates can be chained together; for example, multiple methods can be called on a single event. Methods don't have to match the delegate type exactly.
After a delegate is created, the method it is associated with never changes; delegate objects are immutable.
Am I doing something entirely off-base here? (Guess this should be the first question. :))
I'm reasonably certain that the disassembly you're seeing is for the benchmark methods only: the instructions needed to load the delegate and its argument, then invoke the delegate. It does not include the body of each delegate.
That's why the only difference is the relative offset in one of the mov
instructions: one of the delegates lives at offset 0 in the struct, and the other lives at offset 8. Swap the declaration order of Compiled
and Declared
, and see how the disassembly changes.
I'm not aware of any way to get Benchmark.NET to spit out the disassembly for calls deeper down in the call tree. The documentation suggests that setting recursiveDepth
to some value n > 1
on [DisassemblyDiagnoser]
should do it, but it doesn't seem to work in this case.
Are you saying there is extra disassembly that we are not seeing?
Correct, you are not seeing the disassembly for the delegate bodies. If there is a difference in how they are being compiled, that's where it would be visible.
Are you saying there is extra disassembly that we are not seeing? Since both bodies are exactly the same (or at least, appear to be the same), I am further unclear on how this would be the case.
The bodies are not necessarily the same. For Expression
-based lambas, the C# compiler does not emit the IL for the described expression; rather, it emits a series of Expression
factory calls to construct an expression tree at runtime. That expression tree describes code that should functionally equivalent to the C# expression from which it was generated, but it is compiled by LambdaCompiler
at runtime upon calling Compile()
. LINQ expression trees are meant to be language-agnostic, and don't necessarily have exact parity with the expressions generated by the C# compiler. Because lambda expressions are compiled by a different (and less sophisticated) compiler, the resulting IL may be a bit different than what the C# compiler would have emitted. For example, the lambda compiler tends to emit more temporary locals than the C# compiler, or at least it did the last time I poked around in the source code.
Your best bet for determining the actual disassembly for each delegate may be to load up SOS.dll in the debugger. I tried to do that myself, but I can't seem to figure out how to get it working in VS2017. I never had trouble in the past. I haven't quite come to terms with the new project model in VS2017 yet, and can't figure out how to enable unmanaged debugging.
OK, I got SOS.dll loaded up with WinDbg, and after a bit of Googling, I'm now able to view the IL and disassembly. First, let's take a look at the method descriptors for the lambda bodies. This is the Declared version:
0:000> !DumpMD 000007fe97686148
Method Name: StackOverflow.Performance.Delegates.Delegates+<>c.<.ctor>b__3_2(System.String)
Class: 000007fe977d14d0
MethodTable: 000007fe97686158
mdToken: 000000000600000e
Module: 000007fe976840c0
IsJitted: yes
CodeAddr: 000007fe977912b0
Transparency: Critical
And this is the Compiled version:
0:000> !DumpMD 000007fe97689390
Method Name: DynamicClass.lambda_method(System.Runtime.CompilerServices.Closure, System.String)
Class: 000007fe97689270
MethodTable: 000007fe976892e8
mdToken: 0000000006000000
Module: 000007fe97688af8
IsJitted: yes
CodeAddr: 000007fe977e0150
Transparency: Transparent
We can dump the IL and see that it is actually the same:
0:000> !DumpIL 000007fe97686148
IL_0000: ldarg.1
IL_0001: callvirt 6000002 System.String.get_Length()
IL_0006: ret
0:000> !DumpIL 000007fe97689390
IL_0000: ldarg.1
IL_0001: callvirt System.String::get_Length
IL_0006: ret
So, too, is the disassembly:
0:000> !U 000007fe977912b0
Normal JIT generated code
StackOverflow.Performance.Delegates.Delegates+<>c.<.ctor>b__3_2(System.String)
Begin 000007fe977912b0, size 4
W:\dump\DelegateBenchmark\StackOverflow.Performance.Delegates\Delegates.cs @ 14:
000007fe`977912b0 8b4208 mov eax,dword ptr [rdx+8]
000007fe`977912b3 c3 ret
0:000> !U 000007fe977e0150
Normal JIT generated code
DynamicClass.lambda_method(System.Runtime.CompilerServices.Closure, System.String)
Begin 000007fe977e0150, size 4
000007fe`977e0150 8b4208 mov eax,dword ptr [rdx+8]
000007fe`977e0153 c3 ret
So, we have the same IL, and the same assembly. Where is the difference coming from? Let's take a look at the actual delegate instances. By that, I don't mean the lambda bodies, but the Delegate
objects we use to invoke the lambdas.
0:000> !DumpVC /d 000007fe97686040 0000000002a84410
Name: StackOverflow.Performance.Delegates.DelegatePair`2[[System.String, mscorlib],[System.Int32, mscorlib]]
MethodTable: 000007fe97686040
EEClass: 000007fe977d12d0
Size: 32(0x20) bytes
File: W:\dump\DelegateBenchmark\StackOverflow.Performance.Delegates\bin\Release\net461\StackOverflow.Performance.Delegates.exe
Fields:
MT Field Offset Type VT Attr Value Name
000007fef692e400 4000001 0 ...Int32, mscorlib]] 0 instance 0000000002a8b4d8 <Declared>k__BackingField
000007fef692e400 4000002 8 ...Int32, mscorlib]] 0 instance 0000000002a8d3f8 <Compiled>k__BackingField
We have two delegate values: in my case, Declared
lives at 02a8b4d8
, while Compiled
lives at 02a8d3f8
(these addresses are unique to my process). If we dump each of these addresses with !DumpObject
and look for the _methodPtr
value, we can see see the addresses for the compiled methods. We can then dump the assembly with !U
:
0:000> !U 7fe977e0150
Normal JIT generated code
DynamicClass.lambda_method(System.Runtime.CompilerServices.Closure, System.String)
Begin 000007fe977e0150, size 4
000007fe`977e0150 8b4208 mov eax,dword ptr [rdx+8]
000007fe`977e0153 c3 ret
Ok, for Compiled
, we can see that we're calling directly into the lambda body. Nice. But when we dump the disassembly for the Declared
version, we see something different:
0:000> !U 7fe977901d8
Unmanaged code
000007fe`977901d8 e8f326635f call clr!PrecodeFixupThunk (000007fe`f6dc28d0)
000007fe`977901dd 5e pop rsi
000007fe`977901de 0400 add al,0
000007fe`977901e0 286168 sub byte ptr [rcx+68h],ah
000007fe`977901e3 97 xchg eax,edi
000007fe`977901e4 fe07 inc byte ptr [rdi]
000007fe`977901e6 0000 add byte ptr [rax],al
000007fe`977901e8 0000 add byte ptr [rax],al
000007fe`977901ea 0000 add byte ptr [rax],al
000007fe`977901ec 0000 add byte ptr [rax],al
Hello there. I remember seeing references to clr!PrecodeFixupThunk
in a blog post by Matt Warren. My understanding is that the entry point for a normal IL method (as opposed to a dynamic method like our LINQ-based method) calls into a fixup method that invokes the JIT on the first invocation, then calls into the JITed method on subsequent invocations. The additional overhead of that 'thunk' when invoking the 'declared' delegate would appear to be the cause. The 'compiled' delegate has no such thunk; the delegate points directly to the compiled lambda body.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With