Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Why using instance method as delegate allocates GC0 temp objects but 10% faster than a cached delegate

I am currently optimizing a low-level library and have found a counter-intuitive case. The commit that caused this question is here.

There is a delegate

public delegate void FragmentHandler(UnsafeBuffer buffer, int offset, int length, Header header);

and an instance method

public void OnFragment(IDirectBuffer buffer, int offset, int length, Header header)
{
    _totalBytes.Set(_totalBytes.Get() + length);
}

On this line, if I use the method as a delegate, the program allocates many GC0 for the temp delegate wrapper, but the performance is 10% faster (but not stable).

var fragmentsRead = image.Poll(OnFragment, MessageCountLimit);

If I instead cache the method in a delegate outside the loop like this:

FragmentHandler onFragmentHandler = OnFragment;

then the program does not allocate at all, numbers are very stable but much slower.

I looked through generated IL and it is doing the same thing, but in the later case newobj is called only once and then local variable if loaded.

With cached delegate IL_0034:

IL_002d: ldarg.0
IL_002e: ldftn instance void Adaptive.Aeron.Samples.IpcThroughput.IpcThroughput/Subscriber::OnFragment(class [Adaptive.Agrona]Adaptive.Agrona.IDirectBuffer, int32, int32, class [Adaptive.Aeron]Adaptive.Aeron.LogBuffer.Header)
IL_0034: newobj instance void [Adaptive.Aeron]Adaptive.Aeron.LogBuffer.FragmentHandler::.ctor(object, native int)
IL_0039: stloc.3
IL_003a: br.s IL_005a
// loop start (head: IL_005a)
    IL_003c: ldloc.0
    IL_003d: ldloc.3
    IL_003e: ldsfld int32 Adaptive.Aeron.Samples.IpcThroughput.IpcThroughput::MessageCountLimit
    IL_0043: callvirt instance int32 [Adaptive.Aeron]Adaptive.Aeron.Image::Poll(class [Adaptive.Aeron]Adaptive.Aeron.LogBuffer.FragmentHandler, int32)
    IL_0048: stloc.s fragmentsRead

With temp allocations IL_0037:

IL_002c: stloc.2
IL_002d: br.s IL_0058
// loop start (head: IL_0058)
    IL_002f: ldloc.0
    IL_0030: ldarg.0
    IL_0031: ldftn instance void Adaptive.Aeron.Samples.IpcThroughput.IpcThroughput/Subscriber::OnFragment(class [Adaptive.Agrona]Adaptive.Agrona.IDirectBuffer, int32, int32, class [Adaptive.Aeron]Adaptive.Aeron.LogBuffer.Header)
    IL_0037: newobj instance void [Adaptive.Aeron]Adaptive.Aeron.LogBuffer.FragmentHandler::.ctor(object, native int)
    IL_003c: ldsfld int32 Adaptive.Aeron.Samples.IpcThroughput.IpcThroughput::MessageCountLimit
    IL_0041: callvirt instance int32 [Adaptive.Aeron]Adaptive.Aeron.Image::Poll(class [Adaptive.Aeron]Adaptive.Aeron.LogBuffer.FragmentHandler, int32)
    IL_0046: stloc.s fragmentsRead

Why the code with allocations is faster here? What is needed to avoid allocations but keep the performance?

(testing on .NET 4.5.2/4.6.1, x64, Release, on two different machines)

Update

Here is standalone example that behaves as expected: cached delegate performs more than 2x faster with 4 sec vs 11 sec. So the question is specific to the referenced project - what subtle issues with JIT compiler or something else could cause the unexpected result?

using System;
using System.Diagnostics;

namespace TestCachedDelegate {

    public delegate int TestDelegate(int first, int second);

    public static class Program {
        static void Main(string[] args)
        {
            var tc = new TestClass();
            tc.Run();
        }

        public class TestClass {

            public void Run() {
                var sw = new Stopwatch();
                sw.Restart();
                for (int i = 0; i < 1000000000; i++) {
                    CallDelegate(Add, i, i);
                }
                sw.Stop();
                Console.WriteLine("Non-cached: " + sw.ElapsedMilliseconds);
                sw.Restart();
                TestDelegate dlgCached = Add;
                for (int i = 0; i < 1000000000; i++) {
                    CallDelegate(dlgCached, i, i);
                }
                sw.Stop();
                Console.WriteLine("Cached: " + sw.ElapsedMilliseconds);
                Console.ReadLine();
            }

            public int CallDelegate(TestDelegate dlg, int first, int second) {
                return dlg(first, second);
            }

            public int Add(int first, int second) {
                return first + second;
            }

        }
    }
}
like image 376
V.B. Avatar asked May 22 '16 09:05

V.B.


People also ask

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...

Is C language easy?

Compared to other languages—like Java, PHP, or C#—C is a relatively simple language to learn for anyone just starting to learn computer programming because of its limited number of keywords.

What is C in C language?

What is C? C is a general-purpose programming language created by Dennis Ritchie at the Bell Laboratories in 1972. It is a very popular language, despite being old. C is strongly associated with UNIX, as it was developed to write the UNIX operating system.

What is the full name of C?

In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr.


1 Answers

So after reading the question much too quickly and thinking the it was asking something else I've finally had some time to sit down and play with the Aeoron test in question.

I tried a few things, first of all I compared the IL and Assembler produced and found that there was basically no difference at either the site where we call Poll() or at the site where the handler is actually called.

Secondly I tried commenting out the code in the Poll() method to confirm that the cached version did actually run faster (which it did).

Thridly I tried looking at the CPU counters (Cache misses, instructions retired and branch mis-predictions) in the VS profiler but could not see any differences between the two version other than the fact that the delegate constructor obviously was called more times.

This made me think about a similar case that we ran accross in porting Disruptor-net where we had a test that was running slower than the java version but we were sure we weren't doing anything more costly. The reason for the "slowness" of the test was that we were actually faster and therefore batched less and therefore our throughput was lower.

If you insert a Thread.SpinWait(5) just before the call to Poll() you will see the same or better performance as the non-cached version.

Original answer to the question which I thought at the time was "why using an instance method delegate is slower than caching the delegate manually":

The clue is in the question. It's an instance method and therefore it implicitly captures the this member and the fact that this is captured means that it cannot be cached. Given that this will never change during the lifetime of the cached delegate it should be cacheable.

If you expand the method group to (first, second) => this.Add(first, second) the capture becomes more obvious.

Note that the Roslyn team is working on fixing this: https://github.com/dotnet/roslyn/issues/5835

like image 127
Slugart Avatar answered Oct 01 '22 15:10

Slugart