Performance of struct tuples

Tags:

The following F# program defines a function that returns the smaller of two pairs of ints represented as struct tuples and it takes 1.4s to run:

Click to copy

let [<EntryPoint>] main _ =
  let min a b : int = if a < b then a else b
  let min (struct(a1, b1)) (struct(a2, b2)) = struct(min a1 a2, min b1 b2)
  let mutable x = struct(0, 0)
  for i in 1..100000000 do
    x <- min x (struct(i, i))
  0

If I decompile the CIL to C# I get this code:

Click to copy

    public static int MinInt(int a, int b)
    {
        if (a < b)
        {
            return a;
        }
        return b;
    }

    public static System.ValueTuple<int, int> MinPair(System.ValueTuple<int, int> _arg2, System.ValueTuple<int, int> _arg1)
    {
        int b = _arg2.Item2;
        int a = _arg2.Item1;
        int b2 = _arg1.Item2;
        int a2 = _arg1.Item1;
        return new System.ValueTuple<int, int>(MinInt(a, a2), MinInt(b, b2));
    }

    public static void Main(string[] args)
    {
        System.ValueTuple<int, int> x = new System.ValueTuple<int, int>(0, 0);
        for (int i = 1; i <= 100000000; i++)
        {
            x = MinPair(x, new System.ValueTuple<int, int>(i, i));
        }
    }

Recompiling that with the C# compiler it takes just 0.3s which is over 4x faster than the original F#.

I cannot see why one program is much faster than the other. I've even decompiled both versions to CIL and cannot see any obvious reason. Calling the C# Min function from F# gives the same (poor) performance. The CIL of the inner loop of the caller are literally identical.

Can anyone explain this substantial performance difference?

304

asked Sep 21 '17 22:09

J D

1 Answers

Are you running both examples in the same architecture. I get ~1.4sec on x64 for both F# and C# code and ~0.6sec on x86 for F# and ~0.3sec on x86 for C#.

As you say when decompiling the assemblies the code looks awefully similar but some dissimilarties appear when examining the IL code:

F# - let min (struct(a1, b1)) (struct(a2, b2)) ...

Click to copy

.maxstack 5
.locals init (
  [0] int32 b1,
  [1] int32 a1,
  [2] int32 b2,
  [3] int32 a2
)

IL_0000: ldarga.s _arg2
IL_0002: ldfld !1 valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::Item2
IL_0007: stloc.0
IL_0008: ldarga.s _arg2
IL_000a: ldfld !0 valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::Item1
IL_000f: stloc.1
IL_0010: ldarga.s _arg1
IL_0012: ldfld !1 valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::Item2
IL_0017: stloc.2
IL_0018: ldarga.s _arg1
IL_001a: ldfld !0 valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::Item1
IL_001f: stloc.3
IL_0020: nop
IL_0021: ldloc.1
IL_0022: ldloc.3
IL_0023: call int32 Program::min@8(int32, int32)
IL_0028: ldloc.0
IL_0029: ldloc.2
IL_002a: call int32 Program::min@8(int32, int32)
IL_002f: newobj instance void valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::.ctor(!0, !1)
IL_0034: ret

C# - MinPair

Click to copy

.maxstack 3
.locals init (
  [0] int32 b,
  [1] int32 b2,
  [2] int32 a2
)

IL_0000: ldarg.0
IL_0001: ldfld !1 valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::Item2
IL_0006: stloc.0
IL_0007: ldarg.0
IL_0008: ldfld !0 valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::Item1
IL_000d: ldarg.1
IL_000e: ldfld !1 valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::Item2
IL_0013: stloc.1
IL_0014: ldarg.1
IL_0015: ldfld !0 valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::Item1
IL_001a: stloc.2
IL_001b: ldloc.2
IL_001c: call int32 PerfItCs.Program::MinInt(int32, int32)
IL_0021: ldloc.0
IL_0022: ldloc.1
IL_0023: call int32 PerfItCs.Program::MinInt(int32, int32)
IL_0028: newobj instance void valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::.ctor(!0, !1)
IL_002d: ret

The difference here is that the C# compiler avoids introducing some local variables by pushing the intermediate results on the stack. As local variables are allocated on the stack anyway it's hard to see why this should lead to more efficient code.

The other functions are very similar.

Disassembling the x86 yields this:

F# - the loop

Click to copy

; F#
; struct (i, i) 
01690a7e 8bce            mov     ecx,esi
01690a80 8bd6            mov     edx,esi
; Loads x (pair) onto stack
01690a82 8d45f0          lea     eax,[ebp-10h]
01690a85 83ec08          sub     esp,8
01690a88 f30f7e00        movq    xmm0,mmword ptr [eax]
01690a8c 660fd60424      movq    mmword ptr [esp],xmm0
; Push new tuple on stack
01690a91 52              push    edx
01690a92 51              push    ecx
; Loads pointer to x into ecx (result will be written here)
01690a93 8d4df0          lea     ecx,[ebp-10h]
; Call min
01690a96 ff15744dfe00    call    dword ptr ds:[0FE4D74h]
; Increase i
01690a9c 46              inc     esi
01690a9d 81fe01e1f505    cmp     esi,offset FSharp_Core_ni+0x6be101 (05f5e101)
; Reached the end?
01690aa3 7cd9            jl      01690a7e

C# - the loop

Click to copy

; C#
; Loads x (pair) into ecx, eax
02c2057b 8d55ec          lea     edx,[ebp-14h]
02c2057e 8b0a            mov     ecx,dword ptr [edx]
02c20580 8b4204          mov     eax,dword ptr [edx+4]
; new System.ValueTuple<int, int>(i, i) 
02c20583 8bfe            mov     edi,esi
02c20585 8bd6            mov     edx,esi
; Push x on stack
02c20587 50              push    eax
02c20588 51              push    ecx
; Push new tuple on stack
02c20589 52              push    edx
02c2058a 57              push    edi
; Loads pointer to x into ecx (result will be written here)
02c2058b 8d4dec          lea     ecx,[ebp-14h]
; Call MinPair
02c2058e ff15104d2401    call    dword ptr ds:[1244D10h]
; Increase i
02c20594 46              inc     esi
; Reached the end?
02c20595 81fe00e1f505    cmp     esi,5F5E100h
02c2059b 7ede            jle     02c2057b

It's hard to fathom why F# code should perform significantly worse here. The code looks roughly equivalent with the exception on how x is loaded on the stack. Until someone comes up with a good explaination on why I am going to speculate that its because movq has worse latency than push and since all instructions manipulate the stack the CPU can't reorder the instructions to mitigate the latency of movq.

Why the jitter chose movq for the F# code and not for the C# code I currently don't know.

For x64 the performance seems to worsen because of more overhead in the method preludes and more stalling because of aliasing. This is mainly speculation on my part but it's hard to see from the assembly code what except stalling could lower the performance of x64 by a factor 4x.

By marking min as inline both x64 and x86 runs in ~0.15 sec. Not surprisingly as that eliminate all overhead from method preludes and alot of reading and writing to the stack.

Marking F# methods for aggressive inlining (with [MethodImpl (MethodImplOptions.AggressiveInlining)]) doesn't work as the F# compiler removes all such attributes meaning the jitter never sees it but marking the C# methods for aggressive inlining makes the C# code run in ~0.15 sec.

So in the end the x86 jitter chose from some reason to jit the code differently even though the IL code look very similar. Possibly the attributes on the methods affect the jitter as they are a bit different.

The x64 jitter probably could do a better job on pushing the parameters on the stack in a more efficient manner. I guess using push as the x86 jitter is preferrable over mov as the semantics of push is more restricted but that is just speculation on my part.

In cases like this when the methods are cheap marking them as inline can be good.

To be honest I am not sure this helps OP but hopefully it was somewhat interesting.

PS. I run the code on .NET 4.6.2 on an i5 3570K

198

answered Oct 10 '22 23:10

Just another metaprogrammer

Related questions
                            
                                Silent breaking of constructor calls after adding initializer_list constructor
                            
                                Get all key paths from a struct in Swift 4
                            
                                How to resolve Error = 1001 "No Fill" using Facebook Audience Network in android
                            
                                How to use IndexedDB in Angular?
                            
                                File encryption in Git Repository
                            
                                OAuth scopes and application roles & permissions
                            
                                Builder with conditional inclusion of element
                            
                                How to use PyCharm with Google compute engine
                            
                                How to finish activity from ViewModel using Android's architecture components?
                            
                                Replacing tf.placeholder and feed_dict with tf.data API
                            
                                Running suggested command doesn't fix NPM Vulnerability
                            
                                VS Code Unverified Breakpoints

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Performance of struct tuples

Tags:

J D

People also ask

1 Answers

Just another metaprogrammer

Recent Activity

Donate For Us