I am currently optimizing a low-level library and have found a counter-intuitive case. The commit that caused this question is here.
There is a delegate
public delegate void FragmentHandler(UnsafeBuffer buffer, int offset, int length, Header header);
and an instance method
public void OnFragment(IDirectBuffer buffer, int offset, int length, Header header)
{
_totalBytes.Set(_totalBytes.Get() + length);
}
On this line, if I use the method as a delegate, the program allocates many GC0 for the temp delegate wrapper, but the performance is 10% faster (but not stable).
var fragmentsRead = image.Poll(OnFragment, MessageCountLimit);
If I instead cache the method in a delegate outside the loop like this:
FragmentHandler onFragmentHandler = OnFragment;
then the program does not allocate at all, numbers are very stable but much slower.
I looked through generated IL and it is doing the same thing, but in the later case newobj
is called only once and then local variable if loaded.
With cached delegate IL_0034:
IL_002d: ldarg.0
IL_002e: ldftn instance void Adaptive.Aeron.Samples.IpcThroughput.IpcThroughput/Subscriber::OnFragment(class [Adaptive.Agrona]Adaptive.Agrona.IDirectBuffer, int32, int32, class [Adaptive.Aeron]Adaptive.Aeron.LogBuffer.Header)
IL_0034: newobj instance void [Adaptive.Aeron]Adaptive.Aeron.LogBuffer.FragmentHandler::.ctor(object, native int)
IL_0039: stloc.3
IL_003a: br.s IL_005a
// loop start (head: IL_005a)
IL_003c: ldloc.0
IL_003d: ldloc.3
IL_003e: ldsfld int32 Adaptive.Aeron.Samples.IpcThroughput.IpcThroughput::MessageCountLimit
IL_0043: callvirt instance int32 [Adaptive.Aeron]Adaptive.Aeron.Image::Poll(class [Adaptive.Aeron]Adaptive.Aeron.LogBuffer.FragmentHandler, int32)
IL_0048: stloc.s fragmentsRead
With temp allocations IL_0037:
IL_002c: stloc.2
IL_002d: br.s IL_0058
// loop start (head: IL_0058)
IL_002f: ldloc.0
IL_0030: ldarg.0
IL_0031: ldftn instance void Adaptive.Aeron.Samples.IpcThroughput.IpcThroughput/Subscriber::OnFragment(class [Adaptive.Agrona]Adaptive.Agrona.IDirectBuffer, int32, int32, class [Adaptive.Aeron]Adaptive.Aeron.LogBuffer.Header)
IL_0037: newobj instance void [Adaptive.Aeron]Adaptive.Aeron.LogBuffer.FragmentHandler::.ctor(object, native int)
IL_003c: ldsfld int32 Adaptive.Aeron.Samples.IpcThroughput.IpcThroughput::MessageCountLimit
IL_0041: callvirt instance int32 [Adaptive.Aeron]Adaptive.Aeron.Image::Poll(class [Adaptive.Aeron]Adaptive.Aeron.LogBuffer.FragmentHandler, int32)
IL_0046: stloc.s fragmentsRead
Why the code with allocations is faster here? What is needed to avoid allocations but keep the performance?
(testing on .NET 4.5.2/4.6.1, x64, Release, on two different machines)
Update
Here is standalone example that behaves as expected: cached delegate performs more than 2x faster with 4 sec vs 11 sec. So the question is specific to the referenced project - what subtle issues with JIT compiler or something else could cause the unexpected result?
using System;
using System.Diagnostics;
namespace TestCachedDelegate {
public delegate int TestDelegate(int first, int second);
public static class Program {
static void Main(string[] args)
{
var tc = new TestClass();
tc.Run();
}
public class TestClass {
public void Run() {
var sw = new Stopwatch();
sw.Restart();
for (int i = 0; i < 1000000000; i++) {
CallDelegate(Add, i, i);
}
sw.Stop();
Console.WriteLine("Non-cached: " + sw.ElapsedMilliseconds);
sw.Restart();
TestDelegate dlgCached = Add;
for (int i = 0; i < 1000000000; i++) {
CallDelegate(dlgCached, i, i);
}
sw.Stop();
Console.WriteLine("Cached: " + sw.ElapsedMilliseconds);
Console.ReadLine();
}
public int CallDelegate(TestDelegate dlg, int first, int second) {
return dlg(first, second);
}
public int Add(int first, int second) {
return first + second;
}
}
}
}
C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...
Compared to other languages—like Java, PHP, or C#—C is a relatively simple language to learn for anyone just starting to learn computer programming because of its limited number of keywords.
What is C? C is a general-purpose programming language created by Dennis Ritchie at the Bell Laboratories in 1972. It is a very popular language, despite being old. C is strongly associated with UNIX, as it was developed to write the UNIX operating system.
In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr.
So after reading the question much too quickly and thinking the it was asking something else I've finally had some time to sit down and play with the Aeoron test in question.
I tried a few things, first of all I compared the IL and Assembler produced and found that there was basically no difference at either the site where we call Poll()
or at the site where the handler is actually called.
Secondly I tried commenting out the code in the Poll()
method to confirm that the cached version did actually run faster (which it did).
Thridly I tried looking at the CPU counters (Cache misses, instructions retired and branch mis-predictions) in the VS profiler but could not see any differences between the two version other than the fact that the delegate constructor obviously was called more times.
This made me think about a similar case that we ran accross in porting Disruptor-net where we had a test that was running slower than the java version but we were sure we weren't doing anything more costly. The reason for the "slowness" of the test was that we were actually faster and therefore batched less and therefore our throughput was lower.
If you insert a Thread.SpinWait(5) just before the call to Poll()
you will see the same or better performance as the non-cached version.
Original answer to the question which I thought at the time was "why using an instance method delegate is slower than caching the delegate manually":
The clue is in the question. It's an instance method and therefore it implicitly captures the this
member and the fact that this is captured means that it cannot be cached. Given that this
will never change during the lifetime of the cached delegate it should be cacheable.
If you expand the method group to (first, second) => this.Add(first, second)
the capture becomes more obvious.
Note that the Roslyn team is working on fixing this: https://github.com/dotnet/roslyn/issues/5835
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With