Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is it that, in C#, referencing a variable from a function argument is faster than from a private property?

It may be the case that my hardware is the culprit, but during testing, I've found that:

void SomeFunction(AType ofThing) {
    DoSomething(ofThing);
}

...is faster than:

private AType _ofThing;
void SomeFunction() {
    DoSomething(_ofThing);
}

I believe it has to do with how the compiler translates this to CIL. Could anyone please explain, specifically, why does this happen?

Here's some code where it happens:

public void TestMethod1()
{

    var stopwatch = new Stopwatch();

    var r = new int[] { 1, 2, 3, 4, 5 };
    var i = 0;
    stopwatch.Start();
    while (i < 1000000)
    {
        DoSomething(r);
        i++;
    }

    stopwatch.Stop();

    Console.WriteLine(stopwatch.ElapsedMilliseconds);

    i = 0;
    stopwatch.Restart();
    while (i < 1000000)
    {
        DoSomething();
        i++;
    }

    stopwatch.Stop();

    Console.WriteLine(stopwatch.ElapsedMilliseconds);
}

private void DoSomething(int[] arg1)
{
    var r = arg1[0] * arg1[1] * arg1[2] * arg1[3] * arg1[4];
}

private int[] _arg1 = new [] { 1, 2, 3, 4, 5 };
private void DoSomething()
{
    var r = _arg1[0] * _arg1[1] * _arg1[2] * _arg1[3] * _arg1[4];
}

In my case it is 2.5x slower to use a private property.

like image 791
AStackOverflowUser Avatar asked Dec 02 '22 17:12

AStackOverflowUser


2 Answers

I believe it has to do with how the compiler translates this to CIL.

Not really. Performance doesn't directly depend on the CIL code, because that's not what's actually executed. What's executed is the JITed native code, so you should look at that when you're interested in performance.

So, let's look the the code generated for the DoSomething(int[]) loop:

mov         eax,dword ptr [ebx+4] ; get the length of the array
cmp         eax,0       ; if it's 0
jbe         0000018C    ; jump to code that throws IndexOutOfRangeException
cmp         eax,1       ; if it's 1, etc.
jbe         0000018C 
cmp         eax,2 
jbe         0000018C 
cmp         eax,3 
jbe         0000018C 
cmp         eax,4 
jbe         0000018C 
inc         esi         ; i++
cmp         esi,0F4240h ; if i < 1000000
jl          000000B7    ; loop again

What's interesting about this code is that there is no useful work done at all, most of the code is array bounds checking (why the code hasn't been optimized to perform this checking only once before the loop, I have no idea).

Also notice that the code is inlined, you're not paying the cost of a function call.

This code takes around 1.7 ms on my computer.

So, how does the loop for DoSomething() look like?

mov         ecx,dword ptr [ebp-10h]  ; access this
call        dword ptr ds:[001637F4h] ; call DoSomething()
inc         esi                      ; i++
cmp         esi,0F4240h              ; if i < 1000000
jl          00000120                 ; loop again

Okay, so this actually calls the method, no inlining this time. What does the method itself look like?

mov         eax,dword ptr [ecx+4] ; access this._arg1
cmp         dword ptr [eax+4],0   ; if its length is 0
jbe         00000022 ; jump to code that throws IndexOutOfRangeException
cmp         dword ptr [eax+4],1   ; etc.
jbe         00000022 
cmp         dword ptr [eax+4],2 
jbe         00000022 
cmp         dword ptr [eax+4],3 
jbe         00000022 
cmp         dword ptr [eax+4],4 
jbe         00000022 
ret                               ; bounds checks successful, return

Comparing with the previous version (and ignoring the overhead of the function call for now), this does three different memory accesses instead of just one, which could explain some of the performance difference. (I think the five accesses to eax+4 should be counted only as one, because otherwise the compiler would optimize them.)

This code runs in about 3.0 ms for me.

How much overhead does the method call take? We can check that by adding [MethodImpl(MethodImplOptions.NoInlining)] to the previously inlined DoSomething(int[]). The assembly now looks like this:

mov         ecx,dword ptr [ebp-10h]  ; access this
mov         edx,dword ptr [ebp-14h]  ; access r
call        dword ptr ds:[002937E8h] ; call DoSomething(int[])
inc         esi                      ; i++
cmp         esi,0F4240h              ; if i < 1000000
jl          000000A0                 ; loop again

Notice that r is now no longer kept in a register, it's instead on the stack, which will add another slowdown.

Now DoSomething(int[]):

push        ebp                   ; save ebp from caller to stack
mov         ebp,esp               ; write our own ebp
mov         eax,dword ptr [edx+4] ; read the length of the array
cmp         eax,0    ; if it's 0
jbe         00000021 ; jump to code that throws IndexOutOfRangeException
cmp         eax,1    ; etc.
jbe         00000021 
cmp         eax,2 
jbe         00000021 
cmp         eax,3 
jbe         00000021 
cmp         eax,4 
jbe         00000021 
pop         ebp      ; restore ebp
ret                  ; return

This code runs in about 3.2 ms for me. That's even slower than DoSomething(). What's going on?

Turns out, [MethodImpl(MethodImplOptions.NoInlining)] seems to cause those unnecessary ebp instructions. If I add that attribute to DoSomething(), it runs in 3.3 ms.

This means the difference between stack access and heap access is pretty small (but still measurable). The fact that the array pointer could be kept in a register when the method was inlined was probably more significant.


So, the conclusion is that the big difference you're seeing is because of inlining. The JIT compiler decided inline the code for DoSomething(int[]), but not for DoSomething(), which allowed the code for DoSomething(int[]) to be very efficient. The most likely reason for that is because the IL for DoSomething() is much longer (21 bytes vs. 46 bytes).

Also, you're not really measuring what you wrote (array accesses and multiplications), because that could be optimized out. So be careful with devising your microbenchmarks, so that the compiler can't ignore the code you actually wanted to measure.

like image 146
svick Avatar answered Dec 04 '22 07:12

svick


Several people have made a stack/heap distinction, but this is a false dichotomy; when the IL is compiled to machine code there are additional possibilities, such as passing arguments in registers, which is potentially even faster than getting them off of the stack. See Eric Lippert's great blog post The Truth About Value Types for more thoughts along these lines. In any case, a proper analysis of the performance difference will almost certainly require looking at the generated machine code, not at the IL, and will potentially depend on the version of the JIT compiler, etc.

like image 34
kvb Avatar answered Dec 04 '22 06:12

kvb