Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Is a Compiled Delegate Faster Than a Declared Delegate?

To start with, this is not the same as Why is Func<> created from Expression> slower than Func<> declared directly? and is surprisingly just the opposite of it. Additionally, all links and questions that I have found while researching this issue all originate out of the 2010-2012 time period so I have decided to open a new question here to see if there is some discussion to be had around the current state of delegate behavior in the .NET ecosystem.

That said, I am using .NET Core 2.0 and .NET 4.7.1 and am seeing some curious performance metrics in regards to delegates that are created from a compiled expression versus delegates that are described and declared as a CLR object.

For some context on how I stumbled upon this issue, I was doing a test involving a selection of data in arrays of 1,000 and 10,000 objects, and noticed that if I used a compiled expression it was getting faster results across the board. I managed to boil this down to a very simple project that reproduces this issue which you can find here:

https://github.com/Mike-EEE/StackOverflow.Performance.Delegates

For the testing, I have two sets of benchmarks that are used that feature a compiled delegate paired with a declared delegate, resulting in four total core benchmarks.

The first delegate set is comprised of an empty delegate that returns a null string. The second set is a delegate that has a simple expression within it. I wanted to demonstrate that this issue occurs with the simplest of delegates as well as ones with a defined body within it.

These tests are then run on the CLR runtime and the .NET Core runtime via the excellent Benchmark.NET performance product, resulting in eight total benchmarks. Additionally, I also make use of the just-as-excellent Benchmark.NET disassembly diagnoser to emit the disassembly encountered during the JIT of the benchmark measurements. I share the results of this below.

Here is the code that runs the benchmarks. You can see that it is very straight-forward:

[CoreJob, ClrJob, DisassemblyDiagnoser(true, printSource: true)]
public class Delegates
{
    readonly DelegatePair<string, string> _empty;
    readonly DelegatePair<string, int>    _expression;
    readonly string                       _message;

    public Delegates() : this(new DelegatePair<string, string>(_ => default, _ => default),
                              new DelegatePair<string, int>(x => x.Length, x => x.Length)) {}

    public Delegates(DelegatePair<string, string> empty, DelegatePair<string, int> expression,
                     string message = "Hello World!")
    {
        _empty      = empty;
        _expression = expression;
        _message    = message;
        EmptyDeclared();
        EmptyCompiled();
        ExpressionDeclared();
        ExpressionCompiled();
    }

    [Benchmark]
    public void EmptyDeclared() => _empty.Declared(default);

    [Benchmark]
    public void EmptyCompiled() => _empty.Compiled(default);

    [Benchmark]
    public void ExpressionDeclared() => _expression.Declared(_message);

    [Benchmark]
    public void ExpressionCompiled() => _expression.Compiled(_message);
}

These are the results I see in Benchmark.NET:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.371 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-4820K CPU 3.70GHz (Haswell), 1 CPU, 8 logical and 8 physical cores
.NET Core SDK=2.1.300-preview2-008533
  [Host] : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT
  Clr    : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2633.0
  Core   : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT


             Method |  Job | Runtime |      Mean |     Error |    StdDev |
------------------- |----- |-------- |----------:|----------:|----------:|
      EmptyDeclared |  Clr |     Clr | 1.3691 ns | 0.0302 ns | 0.0282 ns |
      EmptyCompiled |  Clr |     Clr | 1.1851 ns | 0.0381 ns | 0.0357 ns |
 ExpressionDeclared |  Clr |     Clr | 1.3805 ns | 0.0314 ns | 0.0294 ns |
 ExpressionCompiled |  Clr |     Clr | 1.1431 ns | 0.0396 ns | 0.0371 ns |
      EmptyDeclared | Core |    Core | 1.5733 ns | 0.0329 ns | 0.0308 ns |
      EmptyCompiled | Core |    Core | 0.9326 ns | 0.0275 ns | 0.0244 ns |
 ExpressionDeclared | Core |    Core | 1.6040 ns | 0.0394 ns | 0.0368 ns |
 ExpressionCompiled | Core |    Core | 0.9380 ns | 0.0485 ns | 0.0631 ns |

Do note that the benchmarks that make use of a compiled delegate are consistently faster.

Finally, here are the results of the disassembly encountered for each benchmark:

<style type="text/css">
	table { border-collapse: collapse; display: block; width: 100%; overflow: auto; }
	td, th { padding: 6px 13px; border: 1px solid #ddd; }
	tr { background-color: #fff; border-top: 1px solid #ccc; }
	tr:nth-child(even) { background: #f8f8f8; }
</style>
</head>
<body>
<table>
<thead>
<tr><th colspan="2">Delegates.EmptyDeclared</th></tr>
<tr>
<th>.NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2633.0</th>
<th>.NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align:top;"><pre><code>
00007ffd`4f8f0ea0 StackOverflow.Performance.Delegates.Delegates.EmptyDeclared()
		public void EmptyDeclared() => _empty.Declared(default);
                                 ^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`4f8f0ea4 4883c110        add     rcx,10h
00007ffd`4f8f0ea8 488b01          mov     rax,qword ptr [rcx]
00007ffd`4f8f0eab 488b4808        mov     rcx,qword ptr [rax+8]
00007ffd`4f8f0eaf 33d2            xor     edx,edx
00007ffd`4f8f0eb1 ff5018          call    qword ptr [rax+18h]
00007ffd`4f8f0eb4 90              nop

</code></pre></td>
<td style="vertical-align:top;"><pre><code>
00007ffd`39c8d8b0 StackOverflow.Performance.Delegates.Delegates.EmptyDeclared()
		public void EmptyDeclared() => _empty.Declared(default);
                                 ^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`39c8d8b4 4883c110        add     rcx,10h
00007ffd`39c8d8b8 488b01          mov     rax,qword ptr [rcx]
00007ffd`39c8d8bb 488b4808        mov     rcx,qword ptr [rax+8]
00007ffd`39c8d8bf 33d2            xor     edx,edx
00007ffd`39c8d8c1 ff5018          call    qword ptr [rax+18h]
00007ffd`39c8d8c4 90              nop

</code></pre></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr><th colspan="2">Delegates.EmptyCompiled</th></tr>
<tr>
<th>.NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2633.0</th>
<th>.NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align:top;"><pre><code>
00007ffd`4f8e0ef0 StackOverflow.Performance.Delegates.Delegates.EmptyCompiled()
		public void EmptyCompiled() => _empty.Compiled(default);
                                 ^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`4f8e0ef4 4883c110        add     rcx,10h
00007ffd`4f8e0ef8 488b4108        mov     rax,qword ptr [rcx+8]
00007ffd`4f8e0efc 488b4808        mov     rcx,qword ptr [rax+8]
00007ffd`4f8e0f00 33d2            xor     edx,edx
00007ffd`4f8e0f02 ff5018          call    qword ptr [rax+18h]
00007ffd`4f8e0f05 90              nop

</code></pre></td>
<td style="vertical-align:top;"><pre><code>
00007ffd`39c8d900 StackOverflow.Performance.Delegates.Delegates.EmptyCompiled()
		public void EmptyCompiled() => _empty.Compiled(default);
                                 ^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`39c8d904 4883c110        add     rcx,10h
00007ffd`39c8d908 488b4108        mov     rax,qword ptr [rcx+8]
00007ffd`39c8d90c 488b4808        mov     rcx,qword ptr [rax+8]
00007ffd`39c8d910 33d2            xor     edx,edx
00007ffd`39c8d912 ff5018          call    qword ptr [rax+18h]
00007ffd`39c8d915 90              nop

</code></pre></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr><th colspan="2">Delegates.ExpressionDeclared</th></tr>
<tr>
<th>.NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2633.0</th>
<th>.NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align:top;"><pre><code>
00007ffd`4f8e0f20 StackOverflow.Performance.Delegates.Delegates.ExpressionDeclared()
		public void ExpressionDeclared() => _expression.Declared(_message);
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`4f8e0f24 488d5120        lea     rdx,[rcx+20h]
00007ffd`4f8e0f28 488b02          mov     rax,qword ptr [rdx]
00007ffd`4f8e0f2b 488b5108        mov     rdx,qword ptr [rcx+8]
00007ffd`4f8e0f2f 488b4808        mov     rcx,qword ptr [rax+8]
00007ffd`4f8e0f33 ff5018          call    qword ptr [rax+18h]
00007ffd`4f8e0f36 90              nop

</code></pre></td>
<td style="vertical-align:top;"><pre><code>
00007ffd`39c9d930 StackOverflow.Performance.Delegates.Delegates.ExpressionDeclared()
		public void ExpressionDeclared() => _expression.Declared(_message);
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`39c9d934 488d5120        lea     rdx,[rcx+20h]
00007ffd`39c9d938 488b02          mov     rax,qword ptr [rdx]
00007ffd`39c9d93b 488b5108        mov     rdx,qword ptr [rcx+8]
00007ffd`39c9d93f 488b4808        mov     rcx,qword ptr [rax+8]
00007ffd`39c9d943 ff5018          call    qword ptr [rax+18h]
00007ffd`39c9d946 90              nop

</code></pre></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr><th colspan="2">Delegates.ExpressionCompiled</th></tr>
<tr>
<th>.NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2633.0</th>
<th>.NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align:top;"><pre><code>
00007ffd`4f8f0f70 StackOverflow.Performance.Delegates.Delegates.ExpressionCompiled()
		public void ExpressionCompiled() => _expression.Compiled(_message);
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`4f8f0f74 488d5120        lea     rdx,[rcx+20h]
00007ffd`4f8f0f78 488b4208        mov     rax,qword ptr [rdx+8]
00007ffd`4f8f0f7c 488b5108        mov     rdx,qword ptr [rcx+8]
00007ffd`4f8f0f80 488b4808        mov     rcx,qword ptr [rax+8]
00007ffd`4f8f0f84 ff5018          call    qword ptr [rax+18h]
00007ffd`4f8f0f87 90              nop

</code></pre></td>
<td style="vertical-align:top;"><pre><code>
00007ffd`39c9d980 StackOverflow.Performance.Delegates.Delegates.ExpressionCompiled()
		public void ExpressionCompiled() => _expression.Compiled(_message);
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`39c9d984 488d5120        lea     rdx,[rcx+20h]
00007ffd`39c9d988 488b4208        mov     rax,qword ptr [rdx+8]
00007ffd`39c9d98c 488b5108        mov     rdx,qword ptr [rcx+8]
00007ffd`39c9d990 488b4808        mov     rcx,qword ptr [rax+8]
00007ffd`39c9d994 ff5018          call    qword ptr [rax+18h]
00007ffd`39c9d997 90              nop

</code></pre></td>
</tr>
</tbody>
</table>

It would seem that the only difference between declared and compiled delegate disassembly is the rcx for declared vs. the rcx+8 for compiled used within their respective first mov operations. I am not yet that well-spoken in disassembly, so getting context around this would be greatly appreciated. At first glance, it would not seem that this would cause the difference/improvement, and if so, the native-declared delegate should feature it as well (so in other words, a bug).

With all of this stated, the obvious questions to me are:

  1. Is this a known issue and/or bug?
  2. Am I doing something entirely off-base here? (Guess this should be the first question. :))
  3. Is the guidance then to use compiled delegates always wherever possible? As I mentioned earlier, it would seem that the magic that happens in compiled delegates would already be baked into declared delegates, so this is a bit confusing.

For completeness, here is all of the code used in the sample here in its entirety:

sealed class Program
{
    static void Main()
    {
        BenchmarkRunner.Run<Delegates>();
    }
}

[CoreJob, ClrJob, DisassemblyDiagnoser(true, printSource: true)]
public class Delegates
{
    readonly DelegatePair<string, string> _empty;
    readonly DelegatePair<string, int>    _expression;
    readonly string                       _message;

    public Delegates() : this(new DelegatePair<string, string>(_ => default, _ => default),
                              new DelegatePair<string, int>(x => x.Length, x => x.Length)) {}

    public Delegates(DelegatePair<string, string> empty, DelegatePair<string, int> expression,
                     string message = "Hello World!")
    {
        _empty      = empty;
        _expression = expression;
        _message    = message;
        EmptyDeclared();
        EmptyCompiled();
        ExpressionDeclared();
        ExpressionCompiled();
    }

    [Benchmark]
    public void EmptyDeclared() => _empty.Declared(default);

    [Benchmark]
    public void EmptyCompiled() => _empty.Compiled(default);

    [Benchmark]
    public void ExpressionDeclared() => _expression.Declared(_message);

    [Benchmark]
    public void ExpressionCompiled() => _expression.Compiled(_message);
}

public struct DelegatePair<TFrom, TTo>
{
    DelegatePair(Func<TFrom, TTo> declared, Func<TFrom, TTo> compiled)
    {
        Declared = declared;
        Compiled = compiled;
    }

    public DelegatePair(Func<TFrom, TTo> declared, Expression<Func<TFrom, TTo>> expression) :
        this(declared, expression.Compile()) {}

    public Func<TFrom, TTo> Declared { get; }

    public Func<TFrom, TTo> Compiled { get; }
}

Thank you in advance for any assistance that you can provide!

like image 463
Mike-E Avatar asked May 03 '18 07:05

Mike-E


People also ask

What is advantage of using delegates?

Advantages to using them in design:Allow you to develop libraries and classes that are easily extensible, since it provides an easy way to hook in other functionality (for example, a where clause in LINQ can use a delegate [Func<T,bool>] to filter on, without having to write new code in the Where method.

What is delegate and how it is declared?

Declaration of DelegatesDelegate type can be declared using the delegate keyword. Once a delegate is declared, delegate instance will refer and call those methods whose return type and parameter-list matches with the delegate declaration.

What is the purpose of delegates?

Delegates allow methods to be passed as parameters. Delegates can be used to define callback methods. Delegates can be chained together; for example, multiple methods can be called on a single event. Methods don't have to match the delegate type exactly.

Are delegates immutable?

After a delegate is created, the method it is associated with never changes; delegate objects are immutable.


1 Answers

Am I doing something entirely off-base here? (Guess this should be the first question. :))

I'm reasonably certain that the disassembly you're seeing is for the benchmark methods only: the instructions needed to load the delegate and its argument, then invoke the delegate. It does not include the body of each delegate.

That's why the only difference is the relative offset in one of the mov instructions: one of the delegates lives at offset 0 in the struct, and the other lives at offset 8. Swap the declaration order of Compiled and Declared, and see how the disassembly changes.

I'm not aware of any way to get Benchmark.NET to spit out the disassembly for calls deeper down in the call tree. The documentation suggests that setting recursiveDepth to some value n > 1 on [DisassemblyDiagnoser] should do it, but it doesn't seem to work in this case.


Are you saying there is extra disassembly that we are not seeing?

Correct, you are not seeing the disassembly for the delegate bodies. If there is a difference in how they are being compiled, that's where it would be visible.

Are you saying there is extra disassembly that we are not seeing? Since both bodies are exactly the same (or at least, appear to be the same), I am further unclear on how this would be the case.

The bodies are not necessarily the same. For Expression-based lambas, the C# compiler does not emit the IL for the described expression; rather, it emits a series of Expression factory calls to construct an expression tree at runtime. That expression tree describes code that should functionally equivalent to the C# expression from which it was generated, but it is compiled by LambdaCompiler at runtime upon calling Compile(). LINQ expression trees are meant to be language-agnostic, and don't necessarily have exact parity with the expressions generated by the C# compiler. Because lambda expressions are compiled by a different (and less sophisticated) compiler, the resulting IL may be a bit different than what the C# compiler would have emitted. For example, the lambda compiler tends to emit more temporary locals than the C# compiler, or at least it did the last time I poked around in the source code.

Your best bet for determining the actual disassembly for each delegate may be to load up SOS.dll in the debugger. I tried to do that myself, but I can't seem to figure out how to get it working in VS2017. I never had trouble in the past. I haven't quite come to terms with the new project model in VS2017 yet, and can't figure out how to enable unmanaged debugging.


OK, I got SOS.dll loaded up with WinDbg, and after a bit of Googling, I'm now able to view the IL and disassembly. First, let's take a look at the method descriptors for the lambda bodies. This is the Declared version:

0:000> !DumpMD 000007fe97686148

Method Name:  StackOverflow.Performance.Delegates.Delegates+<>c.<.ctor>b__3_2(System.String)
Class:        000007fe977d14d0
MethodTable:  000007fe97686158
mdToken:      000000000600000e
Module:       000007fe976840c0
IsJitted:     yes
CodeAddr:     000007fe977912b0
Transparency: Critical

And this is the Compiled version:

0:000> !DumpMD 000007fe97689390

Method Name:  DynamicClass.lambda_method(System.Runtime.CompilerServices.Closure, System.String)
Class:        000007fe97689270
MethodTable:  000007fe976892e8
mdToken:      0000000006000000
Module:       000007fe97688af8
IsJitted:     yes
CodeAddr:     000007fe977e0150
Transparency: Transparent

We can dump the IL and see that it is actually the same:

0:000> !DumpIL 000007fe97686148

IL_0000: ldarg.1 
IL_0001: callvirt 6000002 System.String.get_Length()
IL_0006: ret 

0:000> !DumpIL 000007fe97689390

IL_0000: ldarg.1 
IL_0001: callvirt System.String::get_Length 
IL_0006: ret

So, too, is the disassembly:

0:000> !U 000007fe977912b0

Normal JIT generated code
StackOverflow.Performance.Delegates.Delegates+<>c.<.ctor>b__3_2(System.String)
Begin 000007fe977912b0, size 4
W:\dump\DelegateBenchmark\StackOverflow.Performance.Delegates\Delegates.cs @ 14:

000007fe`977912b0 8b4208          mov     eax,dword ptr [rdx+8]
000007fe`977912b3 c3              ret

0:000> !U 000007fe977e0150

Normal JIT generated code
DynamicClass.lambda_method(System.Runtime.CompilerServices.Closure, System.String)
Begin 000007fe977e0150, size 4

000007fe`977e0150 8b4208          mov     eax,dword ptr [rdx+8]
000007fe`977e0153 c3              ret

So, we have the same IL, and the same assembly. Where is the difference coming from? Let's take a look at the actual delegate instances. By that, I don't mean the lambda bodies, but the Delegate objects we use to invoke the lambdas.

0:000> !DumpVC /d 000007fe97686040 0000000002a84410

Name:        StackOverflow.Performance.Delegates.DelegatePair`2[[System.String, mscorlib],[System.Int32, mscorlib]]
MethodTable: 000007fe97686040
EEClass:     000007fe977d12d0
Size:        32(0x20) bytes
File:        W:\dump\DelegateBenchmark\StackOverflow.Performance.Delegates\bin\Release\net461\StackOverflow.Performance.Delegates.exe
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
000007fef692e400  4000001        0 ...Int32, mscorlib]]  0 instance 0000000002a8b4d8 <Declared>k__BackingField
000007fef692e400  4000002        8 ...Int32, mscorlib]]  0 instance 0000000002a8d3f8 <Compiled>k__BackingField

We have two delegate values: in my case, Declared lives at 02a8b4d8, while Compiled lives at 02a8d3f8 (these addresses are unique to my process). If we dump each of these addresses with !DumpObject and look for the _methodPtr value, we can see see the addresses for the compiled methods. We can then dump the assembly with !U:

0:000> !U 7fe977e0150 

Normal JIT generated code
DynamicClass.lambda_method(System.Runtime.CompilerServices.Closure, System.String)
Begin 000007fe977e0150, size 4

000007fe`977e0150 8b4208          mov     eax,dword ptr [rdx+8]
000007fe`977e0153 c3              ret

Ok, for Compiled, we can see that we're calling directly into the lambda body. Nice. But when we dump the disassembly for the Declared version, we see something different:

0:000> !U 7fe977901d8 

Unmanaged code

000007fe`977901d8 e8f326635f      call    clr!PrecodeFixupThunk (000007fe`f6dc28d0)
000007fe`977901dd 5e              pop     rsi
000007fe`977901de 0400            add     al,0
000007fe`977901e0 286168          sub     byte ptr [rcx+68h],ah
000007fe`977901e3 97              xchg    eax,edi
000007fe`977901e4 fe07            inc     byte ptr [rdi]
000007fe`977901e6 0000            add     byte ptr [rax],al
000007fe`977901e8 0000            add     byte ptr [rax],al
000007fe`977901ea 0000            add     byte ptr [rax],al
000007fe`977901ec 0000            add     byte ptr [rax],al

Hello there. I remember seeing references to clr!PrecodeFixupThunk in a blog post by Matt Warren. My understanding is that the entry point for a normal IL method (as opposed to a dynamic method like our LINQ-based method) calls into a fixup method that invokes the JIT on the first invocation, then calls into the JITed method on subsequent invocations. The additional overhead of that 'thunk' when invoking the 'declared' delegate would appear to be the cause. The 'compiled' delegate has no such thunk; the delegate points directly to the compiled lambda body.

like image 189
Mike Strobel Avatar answered Nov 11 '22 18:11

Mike Strobel