Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoiding the overhead of C# virtual calls

I have a few heavily optimized math functions that take 1-2 nanoseconds to complete. These functions are called hundreds of millions of times per second, so call overhead is a concern, despite the already-excellent performance.

In order to keep the program maintainable, the classes that provide these methods inherit an IMathFunction interface, so that other objects can directly store a specific math function and use it when needed.

public interface IMathFunction {   double Calculate(double input);   double Derivate(double input); }  public SomeObject {   // Note: There are cases where this is mutable   private readonly IMathFunction mathFunction_;     public double SomeWork(double input, double step)   {     var f = mathFunction_.Calculate(input);     var dv = mathFunction_.Derivate(input);     return f - (dv * step);   } } 

This interface is causing an enormous overhead compared to a direct call due to how the consuming code uses it. A direct call takes 1-2ns, whereas the virtual interface call takes 8-9ns. Evidently, the presence of the interface and its subsequent translation of the virtual call is the bottleneck for this scenario.

I would like to retain both maintainability and performance if possible. Is there a way I can resolve the virtual function to a direct call when the object is instantiated so that all subsequent calls are able to avoid the overhead? I assume this would involve creating delegates with IL, but I wouldn't know where to start with that.

like image 362
Haus Avatar asked Dec 14 '18 19:12

Haus


2 Answers

So this has obvious limitations and should not be used all the time anywhere you have an interface, but if you have a place where perf really needs to be maximized you can use generics:

public SomeObject<TMathFunction> where TMathFunction: struct, IMathFunction  {   private readonly TMathFunction mathFunction_;    public double SomeWork(double input, double step)   {     var f = mathFunction_.Calculate(input);     var dv = mathFunction_.Derivate(input);     return f - (dv * step);   } } 

And instead of passing an interface, pass your implementation as TMathFunction. This will avoid vtable lookups due to an interface and also allow inlining.

Note the use of struct is important here, as generics will otherwise access the class via the interface.

Some implementation:

I made a simple implementation of IMathFunction for testing:

class SomeImplementationByRef : IMathFunction {     public double Calculate(double input)     {         return input + input;     }      public double Derivate(double input)     {         return input * input;     } } 

... as well as a struct version and an abstract version.

So, here's what happens with the interface version. You can see it is relatively inefficient because it performs two levels of indirection:

    return obj.SomeWork(input, step); sub         esp,40h   vzeroupper   vmovaps     xmmword ptr [rsp+30h],xmm6   vmovaps     xmmword ptr [rsp+20h],xmm7   mov         rsi,rcx vmovsd      qword ptr [rsp+60h],xmm2   vmovaps     xmm6,xmm1 mov         rcx,qword ptr [rsi+8]          ; load mathFunction_ into rcx. vmovaps     xmm1,xmm6   mov         r11,7FFED7980020h              ; load vtable address of the IMathFunction.Calculate function. cmp         dword ptr [rcx],ecx   call        qword ptr [r11]                ; call IMathFunction.Calculate function which will call the actual Calculate via vtable. vmovaps     xmm7,xmm0 mov         rcx,qword ptr [rsi+8]          ; load mathFunction_ into rcx. vmovaps     xmm1,xmm6   mov         r11,7FFED7980028h              ; load vtable address of the IMathFunction.Derivate function. cmp         dword ptr [rcx],ecx   call        qword ptr [r11]                ; call IMathFunction.Derivate function which will call the actual Derivate via vtable. vmulsd      xmm0,xmm0,mmword ptr [rsp+60h] ; dv * step vsubsd      xmm7,xmm7,xmm0                 ; f - (dv * step) vmovaps     xmm0,xmm7   vmovaps     xmm6,xmmword ptr [rsp+30h]   vmovaps     xmm7,xmmword ptr [rsp+20h]   add         rsp,40h   pop         rsi   ret   

Here's an abstract class. It's a little more efficient but only negligibly:

        return obj.SomeWork(input, step);  sub         esp,40h    vzeroupper    vmovaps     xmmword ptr [rsp+30h],xmm6    vmovaps     xmmword ptr [rsp+20h],xmm7    mov         rsi,rcx    vmovsd      qword ptr [rsp+60h],xmm2    vmovaps     xmm6,xmm1    mov         rcx,qword ptr [rsi+8]           ; load mathFunction_ into rcx.  vmovaps     xmm1,xmm6    mov         rax,qword ptr [rcx]             ; load object type data from mathFunction_.  mov         rax,qword ptr [rax+40h]         ; load address of vtable into rax.  call        qword ptr [rax+20h]             ; call Calculate via offset 0x20 of vtable.  vmovaps     xmm7,xmm0    mov         rcx,qword ptr [rsi+8]           ; load mathFunction_ into rcx.  vmovaps     xmm1,xmm6    mov         rax,qword ptr [rcx]             ; load object type data from mathFunction_.  mov         rax,qword ptr [rax+40h]         ; load address of vtable into rax.  call        qword ptr [rax+28h]             ; call Derivate via offset 0x28 of vtable.  vmulsd      xmm0,xmm0,mmword ptr [rsp+60h]  ; dv * step  vsubsd      xmm7,xmm7,xmm0                  ; f - (dv * step)  vmovaps     xmm0,xmm7  vmovaps     xmm6,xmmword ptr [rsp+30h]    vmovaps     xmm7,xmmword ptr [rsp+20h]    add         rsp,40h    pop         rsi    ret   

So both an interface and an abstract class rely heavily on branch target prediction to have acceptable performance. Even then, you can see there's quite a lot more going into it, so the best-case is still relatively slow while the worst-case is a stalled pipeline due to a mispredict.

And finally here's the generic version with a struct. You can see it's massively more efficient because everything has been fully inlined so there's no branch prediction involved. It also has the nice side effect of removing most of the stack/parameter management that was in there too, so the code becomes very compact:

    return obj.SomeWork(input, step); push        rax   vzeroupper   movsx       rax,byte ptr [rcx+8]   vmovaps     xmm0,xmm1   vaddsd      xmm0,xmm0,xmm1  ; Calculate - got inlined vmulsd      xmm1,xmm1,xmm1  ; Derivate - got inlined vmulsd      xmm1,xmm1,xmm2  ; dv * step vsubsd      xmm0,xmm0,xmm1  ; f -  add         rsp,8   ret   
like image 82
Cory Nelson Avatar answered Oct 11 '22 02:10

Cory Nelson


I would assign the methods to delegates. This allows you to still program against the interface, while avoiding the interface method resolution.

public SomeObject {     private readonly Func<double, double> _calculate;     private readonly Func<double, double> _derivate;      public SomeObject(IMathFunction mathFunction)     {         _calculate = mathFunction.Calculate;         _derivate = mathFunction.Derivate;     }      public double SomeWork(double input, double step)     {         var f = _calculate(input);         var dv = _derivate(input);         return f - (dv * step);     } } 

In response to @CoryNelson's comment I made tests so see what the impact really is. I have sealed the function class, but this seems to make absolutely no difference since my methods are not virtual.

Test Results (mean time of 100 million iterations in ns) with the empty method time subtracted in braces:

Empty Work method: 1.48
Interface: 5.69 (4.21)
Delegates: 5.78 (4.30)
Sealed Class: 2.10 (0.62)
Class: 2.12 (0.64)

The delegate version time is about the same as for the interface version (the exact times vary from test execution to test execution). While working against the class is about 6.8 x faster (comparing times minus the empty work method time)! This means that my suggestion to work with delegates was not helpful!

What surprised me was, that I expected a much longer execution time for the interface version. Since this kind of test does not represent the exact context of the OP's code, its validity is limited.

static class TimingInterfaceVsDelegateCalls {     const int N = 100_000_000;     const double msToNs = 1e6 / N;      static SquareFunctionSealed _mathFunctionClassSealed;     static SquareFunction _mathFunctionClass;     static IMathFunction _mathFunctionInterface;     static Func<double, double> _calculate;     static Func<double, double> _derivate;      static TimingInterfaceVsDelegateCalls()     {         _mathFunctionClass = new SquareFunction();         _mathFunctionClassSealed = new SquareFunctionSealed();         _mathFunctionInterface = _mathFunctionClassSealed;         _calculate = _mathFunctionInterface.Calculate;         _derivate = _mathFunctionInterface.Derivate;     }      interface IMathFunction     {         double Calculate(double input);         double Derivate(double input);     }      sealed class SquareFunctionSealed : IMathFunction     {         public double Calculate(double input)         {             return input * input;         }          public double Derivate(double input)         {             return 2 * input;         }     }      class SquareFunction : IMathFunction     {         public double Calculate(double input)         {             return input * input;         }          public double Derivate(double input)         {             return 2 * input;         }     }      public static void Test()     {         var stopWatch = new Stopwatch();          stopWatch.Start();         for (int i = 0; i < N; i++) {             double result = SomeWorkEmpty(i);         }         stopWatch.Stop();         double emptyTime = stopWatch.ElapsedMilliseconds * msToNs;         Console.WriteLine($"Empty Work method: {emptyTime:n2}");          stopWatch.Restart();         for (int i = 0; i < N; i++) {             double result = SomeWorkInterface(i);         }         stopWatch.Stop();         PrintResult("Interface", stopWatch.ElapsedMilliseconds, emptyTime);          stopWatch.Restart();         for (int i = 0; i < N; i++) {             double result = SomeWorkDelegate(i);         }         stopWatch.Stop();         PrintResult("Delegates", stopWatch.ElapsedMilliseconds, emptyTime);          stopWatch.Restart();         for (int i = 0; i < N; i++) {             double result = SomeWorkClassSealed(i);         }         stopWatch.Stop();         PrintResult("Sealed Class", stopWatch.ElapsedMilliseconds, emptyTime);          stopWatch.Restart();         for (int i = 0; i < N; i++) {             double result = SomeWorkClass(i);         }         stopWatch.Stop();         PrintResult("Class", stopWatch.ElapsedMilliseconds, emptyTime);     }      private static void PrintResult(string text, long elapsed, double emptyTime)     {         Console.WriteLine($"{text}: {elapsed * msToNs:n2} ({elapsed * msToNs - emptyTime:n2})");     }      [MethodImpl(MethodImplOptions.NoInlining)]     private static double SomeWorkEmpty(int i)     {         return 0.0;     }      [MethodImpl(MethodImplOptions.NoInlining)]     private static double SomeWorkInterface(int i)     {         double f = _mathFunctionInterface.Calculate(i);         double dv = _mathFunctionInterface.Derivate(i);         return f - (dv * 12.34534);     }      [MethodImpl(MethodImplOptions.NoInlining)]     private static double SomeWorkDelegate(int i)     {         double f = _calculate(i);         double dv = _derivate(i);         return f - (dv * 12.34534);     }      [MethodImpl(MethodImplOptions.NoInlining)]     private static double SomeWorkClassSealed(int i)     {         double f = _mathFunctionClassSealed.Calculate(i);         double dv = _mathFunctionClassSealed.Derivate(i);         return f - (dv * 12.34534);     }      [MethodImpl(MethodImplOptions.NoInlining)]     private static double SomeWorkClass(int i)     {         double f = _mathFunctionClass.Calculate(i);         double dv = _mathFunctionClass.Derivate(i);         return f - (dv * 12.34534);     } } 

The idea of [MethodImpl(MethodImplOptions.NoInlining)] is to prevent the compiler from calculating the addresses of the methods before the loop if the method was inlined.

like image 28
Olivier Jacot-Descombes Avatar answered Oct 11 '22 01:10

Olivier Jacot-Descombes