I am working on some very performance critical code and has discovered that calling an anonymous method using a delegate performs worse than calling the same code through a Func delegate.
public class DelegateTests
{
public delegate int GetValueDelegate(string test);
private Func<string, int> getValueFunc;
private GetValueDelegate getValueDelegate;
public DelegateTests()
{
getValueDelegate = (s) => 42;
getValueFunc = (s) => 42;
}
[Benchmark]
public int CallWithDelegate()
{
return getValueDelegate.Invoke("TEST");
}
[Benchmark]
public int CallWithFunc()
{
return getValueFunc.Invoke("TEST");
}
}
BenchmarkDotNet
gives:
// * Summary *
BenchmarkDotNet=v0.10.4, OS=Windows 10.0.14393
Processor=Intel Core i7-4770HQ CPU 2.20GHz (Haswell), ProcessorCount=2
Frequency=10000000 Hz, Resolution=100.0000 ns, Timer=UNKNOWN
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1637.0
RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1637.0
Job=RyuJitX64 Jit=RyuJit Platform=X64
Method | Mean | Error | StdDev |
----------------- |----------:|----------:|----------:|
CallWithDelegate | 0.9926 ns | 0.0559 ns | 0.0783 ns |
CallWithFunc | 0.8763 ns | 0.0168 ns | 0.0131 ns |
// * Hints *
Outliers
DelegateTests.CallWithFunc: RyuJitX64 -> 3 outliers were removed
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
// ***** BenchmarkRunner: End *****
As we can see, calling the function using a Func
delegate is faster than invoking the function using the GetValueDelegate
.
I'm trying to find evidence as to why it behaves this way.
Looking at the JIT optimized machine code
26: return getValueDelegate.Invoke("TEST");
00E105C0 8B 49 08 mov ecx,dword ptr [ecx+8]
00E105C3 8B 15 C4 22 71 03 mov edx,dword ptr ds:[37122C4h]
00E105C9 8B 41 0C mov eax,dword ptr [ecx+0Ch]
00E105CC 8B 49 04 mov ecx,dword ptr [ecx+4]
00E105CF FF D0 call eax
00E105D1 C3 ret
compared to the
32: return getValueFunc.Invoke("TEST");
00E10608 8B 49 04 mov ecx,dword ptr [ecx+4]
00E1060B 8B 15 C4 22 71 03 mov edx,dword ptr ds:[37122C4h]
00E10611 8B 41 0C mov eax,dword ptr [ecx+0Ch]
00E10614 8B 49 04 mov ecx,dword ptr [ecx+4]
00E10617 FF D0 call eax
00E10619 C3 ret
They look pretty much alike. I'm starting to think that it could be a difference inside the Invoke method for the two delegates. They both derive from MulticastDelegate which is a requirement for all delegates on the CLR. Why is the one faster than the other?
Here are the numbers using LegacyJitx86. Please note that I am just interested in WHY there is difference. BTW, swapping the sequence or variable order does not affect the result
// * Summary *
BenchmarkDotNet=v0.10.4, OS=Windows 10.0.14393
Processor=Intel Core i7-4770HQ CPU 2.20GHz (Haswell), ProcessorCount=2
Frequency=10000000 Hz, Resolution=100.0000 ns, Timer=UNKNOWN
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1637.0
LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1637.0
Job=LegacyJitX86 Jit=LegacyJit Platform=X86
Runtime=Clr
Method | Mean | Error | StdDev |
----------------- |----------:|----------:|----------:|
CallWithDelegate | 2.3385 ns | 0.0361 ns | 0.0320 ns |
CallWithFunc | 2.0144 ns | 0.0410 ns | 0.0384 ns |
// * Hints *
Outliers
DelegateTests.CallWithDelegate: LegacyJitX86 -> 1 outlier was removed
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
// ***** BenchmarkRunner: End *****
Running them with current versions of everything on my machine, I find no consistent winner.
In this run, CallWithFunc
's mean time plus error was 99% of CallWithDelegate
's mean time minus error, so I felt like there might be some remnant of the old behavior...
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-6850K CPU 3.60GHz (Skylake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.1.101
[Host] : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
| Method | Mean | Error | StdDev |
|----------------- |---------:|----------:|----------:|
| CallWithDelegate | 1.103 ns | 0.0024 ns | 0.0019 ns |
| CallWithFunc | 1.090 ns | 0.0050 ns | 0.0044 ns |
But then when I simply ran it again, the winner actually flipped around, so my guess is that if there is a factor that favors one or the other, then it's probably something super specific.
e.g., maybe the faster callback happens to live on the same cache line as some other important thing, and the slower callback might be the only thing keeping the callback in CPU cache (in which case, changing the order of the fields and marking the class with [StructLayout(LayoutKind.Sequential)]
might reveal something).
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-6850K CPU 3.60GHz (Skylake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.1.101
[Host] : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
| Method | Mean | Error | StdDev |
|----------------- |---------:|----------:|----------:|
| CallWithDelegate | 1.062 ns | 0.0036 ns | 0.0030 ns |
| CallWithFunc | 1.094 ns | 0.0039 ns | 0.0034 ns |
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With