Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding a specific CIL / CLR optimization

EDIT: I have added the ASM at the end.

I believe the best way to learn how to write good code on a platform is to experiment with the platform and thereby get to understand it. Therefore, this question is seeking to create a better understanding of the CLR, and is not at attempt at nano optimization.

Notwithstanding, it had occurred to me that it would be faster to fuse the two operations of setting and evaluating a variable. As it turns out, it is. In the code below, the 2nd loop executes in about 60% of the time of the first loop:

private sealed class Temp
{
    public int val;
}

private void button13_Click(object sender, EventArgs e)
{
    Temp t = new Temp();
    Temp t1;

    int T1 = Environment.TickCount;

    for (int i = 0; i < 1000000000; i++)
    {
        t1 = t;

        if (t1.val++ == 1000)
        {
            t1.val = 0;
        }
    }

    int T2 = Environment.TickCount;

    for (int i = 0; i < 1000000000; i++)
    {
        if ((t1 = t).val++ == 1000)
        {
            t1.val = 0;
        }
    }

    int T3 = Environment.TickCount;

    MessageBox.Show((T2 - T1).ToString() + Environment.NewLine + 
       (T3 - T2).ToString() + Environment.NewLine + 
       t.val.ToString());
}

In most cases like this, the CIL compiler creates a duplicate of the set value on the stack, which means that the usually needed store and fetch is not needed. This would account for the apparently significant speed increase.

However, the decompiled C# and IL for this particular piece of code does not do this, but rather adds overhead. Yet it's almost twice as fast.

EDIT2: I switched the loops around physically, and discovered that the second loop is always about twice as fast. Why? So I added a "warm up" loop, which resulted in the first loop being about twice as fast. It's basically the same code (ASM-wise). What is happening behind the scenes?

{
    Temp t1;
    Temp t = new Temp();
    int T1 = Environment.TickCount;
    for (int i = 0; i < 0x3b9aca00; i++)
    {
        t1 = t;
        if (t1.val++ == 0x3e8)
        {
            t1.val = 0;
        }
    }
    int T2 = Environment.TickCount;
    for (int i = 0; i < 0x3b9aca00; i++)
    {
        Temp temp1 = t1 = t;
        if (temp1.val++ == 0x3e8)
        {
            t1.val = 0;
        }
    }
    int T3 = Environment.TickCount;
    string[] CS$0$0002 = new string[] { (T2 - T1).ToString(), Environment.NewLine, (T3 - T2).ToString(), Environment.NewLine, t.val.ToString() };
    MessageBox.Show(string.Concat(CS$0$0002));
}

EDIT: Compiled in 64 bit .Net 4 Release mode

L_0000: newobj instance void DIRECT_UI.Form1/Temp::.ctor()
L_0005: stloc.0 
L_0006: call int32 [mscorlib]System.Environment::get_TickCount()
L_000b: stloc.2 
L_000c: ldc.i4.0 
L_000d: stloc.3 
L_000e: br.s L_0037
L_0010: ldloc.0 
L_0011: stloc.1 
L_0012: ldloc.1 
L_0013: dup 
L_0014: ldfld int32 DIRECT_UI.Form1/Temp::val
L_0019: dup 
L_001a: stloc.s CS$0$0000
L_001c: ldc.i4.1 
L_001d: add 
L_001e: stfld int32 DIRECT_UI.Form1/Temp::val
L_0023: ldloc.s CS$0$0000
L_0025: ldc.i4 0x3e8
L_002a: bne.un.s L_0033
L_002c: ldloc.1 
L_002d: ldc.i4.0 
L_002e: stfld int32 DIRECT_UI.Form1/Temp::val
L_0033: ldloc.3 
L_0034: ldc.i4.1 
L_0035: add 
L_0036: stloc.3 
L_0037: ldloc.3 
L_0038: ldc.i4 0x3b9aca00
L_003d: blt.s L_0010
L_003f: call int32 [mscorlib]System.Environment::get_TickCount()
L_0044: stloc.s T2
L_0046: ldc.i4.0 
L_0047: stloc.s V_5
L_0049: br.s L_0074
L_004b: ldloc.0 
L_004c: dup 
L_004d: stloc.1 
L_004e: dup 
L_004f: ldfld int32 DIRECT_UI.Form1/Temp::val
L_0054: dup 
L_0055: stloc.s CS$0$0001
L_0057: ldc.i4.1 
L_0058: add 
L_0059: stfld int32 DIRECT_UI.Form1/Temp::val
L_005e: ldloc.s CS$0$0001
L_0060: ldc.i4 0x3e8
L_0065: bne.un.s L_006e
L_0067: ldloc.1 
L_0068: ldc.i4.0 
L_0069: stfld int32 DIRECT_UI.Form1/Temp::val
L_006e: ldloc.s V_5
L_0070: ldc.i4.1 
L_0071: add 
L_0072: stloc.s V_5
L_0074: ldloc.s V_5
L_0076: ldc.i4 0x3b9aca00
L_007b: blt.s L_004b
L_007d: call int32 [mscorlib]System.Environment::get_TickCount()
L_0082: stloc.s T3
L_0084: ldc.i4.5 
L_0085: newarr string
L_008a: stloc.s CS$0$0002
L_008c: ldloc.s CS$0$0002
L_008e: ldc.i4.0 
L_008f: ldloc.s T2
L_0091: ldloc.2 
L_0092: sub 
L_0093: stloc.s CS$0$0003
L_0095: ldloca.s CS$0$0003
L_0097: call instance string [mscorlib]System.Int32::ToString()
L_009c: stelem.ref 
L_009d: ldloc.s CS$0$0002
L_009f: ldc.i4.1 
L_00a0: call string [mscorlib]System.Environment::get_NewLine()
L_00a5: stelem.ref 
L_00a6: ldloc.s CS$0$0002
L_00a8: ldc.i4.2 
L_00a9: ldloc.s T3
L_00ab: ldloc.s T2
L_00ad: sub 
L_00ae: stloc.s CS$0$0004
L_00b0: ldloca.s CS$0$0004
L_00b2: call instance string [mscorlib]System.Int32::ToString()
L_00b7: stelem.ref 
L_00b8: ldloc.s CS$0$0002
L_00ba: ldc.i4.3 
L_00bb: call string [mscorlib]System.Environment::get_NewLine()
L_00c0: stelem.ref 
L_00c1: ldloc.s CS$0$0002
L_00c3: ldc.i4.4 
L_00c4: ldloc.0 
L_00c5: ldflda int32 DIRECT_UI.Form1/Temp::val
L_00ca: call instance string [mscorlib]System.Int32::ToString()
L_00cf: stelem.ref 
L_00d0: ldloc.s CS$0$0002
L_00d2: call string [mscorlib]System.String::Concat(string[])
L_00d7: call valuetype [System.Windows.Forms]System.Windows.Forms.DialogResult [System.Windows.Forms]System.Windows.Forms.MessageBox::Show(string)
L_00dc: pop 
L_00dd: ret 

This doesn't make sense to me. It looks like reverse optimization, but runs faster. Can anyone shed some light on this?

ASM:

                t1 = t;
000000ac  mov         rax,qword ptr [rsp+20h] 
000000b1  mov         qword ptr [rsp+28h],rax 

                if (t1.val++ == 1000)
000000b6  mov         rax,qword ptr [rsp+28h] 
000000bb  mov         eax,dword ptr [rax+8] 
000000be  mov         dword ptr [rsp+74h],eax 
000000c2  mov         eax,dword ptr [rsp+74h] 
000000c6  mov         dword ptr [rsp+44h],eax 
000000ca  mov         ecx,dword ptr [rsp+74h] 
000000ce  inc         ecx 
000000d0  mov         rax,qword ptr [rsp+28h] 
000000d5  mov         dword ptr [rax+8],ecx 
000000d8  cmp         dword ptr [rsp+44h],3E8h 
000000e0  jne         00000000000000EE
                if ((t1 = t).val++ == 1000)
0000011d  mov         rax,qword ptr [rsp+20h] 
00000122  mov         qword ptr [rsp+28h],rax 
00000127  mov         rax,qword ptr [rsp+20h] 
0000012c  mov         eax,dword ptr [rax+8] 
0000012f  mov         dword ptr [rsp+7Ch],eax 
00000133  mov         eax,dword ptr [rsp+7Ch] 
00000137  mov         dword ptr [rsp+48h],eax 
0000013b  mov         ecx,dword ptr [rsp+7Ch] 
0000013f  inc         ecx 
00000141  mov         rax,qword ptr [rsp+20h] 
00000146  mov         dword ptr [rax+8],ecx 
00000149  cmp         dword ptr [rsp+48h],3E8h 
00000151  jne         000000000000015F
like image 339
IamIC Avatar asked Sep 11 '25 15:09

IamIC


1 Answers

Generated IL has only an indirect impact on code efficiency. Tools + Options, Debugging, General, untick the "Suppress JIT optimization on module load" option. This enables the JIT optimizer even when you debug the program. Make sure you got the Release configuration selected.

Set a breakpoint on button13_Click. Run and click the button. Right-click in the source code editor window and select "Go To Assembly".

Note how both loops generate the exact same machine code. Both for the x86 and the x64 jitter. This is the way it should be of course, code that perform the same logical operation should produce the same machine code. All is well here.

This doesn't necessarily mean it will run at the exact same speed, although it often does. Code alignment is critical.

like image 171
Hans Passant Avatar answered Sep 13 '25 04:09

Hans Passant