Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does JIT order affect performance?

Why does the order in which C# methods in .NET 4.0 are just-in-time compiled affect how quickly they execute? For example, consider two equivalent methods:

public static void SingleLineTest() {     Stopwatch stopwatch = new Stopwatch();     stopwatch.Start();     int count = 0;     for (uint i = 0; i < 1000000000; ++i) {         count += i % 16 == 0 ? 1 : 0;     }     stopwatch.Stop();     Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds); }  public static void MultiLineTest() {     Stopwatch stopwatch = new Stopwatch();     stopwatch.Start();     int count = 0;     for (uint i = 0; i < 1000000000; ++i) {         var isMultipleOf16 = i % 16 == 0;         count += isMultipleOf16 ? 1 : 0;     }     stopwatch.Stop();     Console.WriteLine("Multi-line test  --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds); } 

The only difference is the introduction of a local variable, which affects the assembly code generated and the loop performance. Why that is the case is a question in its own right.

Possibly even stranger is that on x86 (but not x64), the order that the methods are invoked has around a 20% impact on performance. Invoke the methods like this...

static void Main() {     SingleLineTest();     MultiLineTest(); } 

...and SingleLineTest is faster. (Compile using the x86 Release configuration, ensuring that "Optimize code" setting is enabled, and run the test from outside VS2010.) But reverse the order...

static void Main() {     MultiLineTest();     SingleLineTest(); } 

...and both methods take the same time (almost, but not quite, as long as MultiLineTest before). (When running this test, it's useful to add some additional calls to SingleLineTest and MultiLineTest to get additional samples. How many and what order doesn't matter, except for which method is called first.)

Finally, to demonstrate that JIT order is important, leave MultiLineTest first, but force SingleLineTest to be JITed first...

static void Main() {     RuntimeHelpers.PrepareMethod(typeof(Program).GetMethod("SingleLineTest").MethodHandle);     MultiLineTest();     SingleLineTest(); } 

Now, SingleLineTest is faster again.

If you turn off "Suppress JIT optimization on module load" in VS2010, you can put a breakpoint in SingleLineTest and see that the assembly code in the loop is the same regardless of JIT order; however, the assembly code at the beginning of the method varies. But how this matters when the bulk of the time is spent in the loop is perplexing.

A sample project demonstrating this behavior is on github.

It's not clear how this behavior affects real-world applications. One concern is that it can make performance tuning volatile, depending on the order methods happen to be first called. Problems of this sort would be difficult to detect with a profiler. Once you found the hotspots and optimized their algorithms, it would be hard to know without a lot of guess and check whether additional speedup is possible by JITing methods early.

Update: See also the Microsoft Connect entry for this issue.

like image 546
Edward Brey Avatar asked May 02 '12 02:05

Edward Brey


People also ask

What is the main problem with JIT production strategy?

Because JIT production is based entirely on existing orders, it is not the most efficient system for dealing with the unexpected. A company that uses this strategy may be ill-equipped to handle a sudden surge in demand for a product.

How does JIT influence efficiency?

Reduction in storage and waiting time Relatedly, the JIT system encourages faster setups of production machinery. Producing small lots of product only as required demands a faster set up of machinery, resulting in more efficiency and reduced costs. The result is higher quality raw materials and finished products.

What is the disadvantages of JIT purchasing?

Risk of Running Out of Stock - With JIT manufacturing, you do not carry as much stock. This is because you base your stock off of demand forecasts, and if those are incorrect, then you will not have the correct amount of stock readily available for your consumers.

How does JIT purchasing affect lead time?

Shorter Production Cycles: JIT shortens manufacturing time, which decreases lead times for customers. Reduce Product Defects: Production mistakes can be spotted faster and corrected, which results in fewer defective products.


1 Answers

Please note that I do not trust the "Suppress JIT optimization on module load" option, I spawn the process without debugging and attach my debugger after the JIT has run.

In the version where single-line runs faster, this is Main:

        SingleLineTest(); 00000000  push        ebp  00000001  mov         ebp,esp  00000003  call        dword ptr ds:[0019380Ch]              MultiLineTest(); 00000009  call        dword ptr ds:[00193818h]              SingleLineTest(); 0000000f  call        dword ptr ds:[0019380Ch]              MultiLineTest(); 00000015  call        dword ptr ds:[00193818h]              SingleLineTest(); 0000001b  call        dword ptr ds:[0019380Ch]              MultiLineTest(); 00000021  call        dword ptr ds:[00193818h]  00000027  pop         ebp          } 00000028  ret  

Note that MultiLineTest has been placed on an 8 byte boundary, and SingleLineTest on a 4 byte boundary.

Here's Main for the version where both run at the same speed:

            MultiLineTest(); 00000000  push        ebp  00000001  mov         ebp,esp  00000003  call        dword ptr ds:[00153818h]               SingleLineTest(); 00000009  call        dword ptr ds:[0015380Ch]              MultiLineTest(); 0000000f  call        dword ptr ds:[00153818h]              SingleLineTest(); 00000015  call        dword ptr ds:[0015380Ch]              MultiLineTest(); 0000001b  call        dword ptr ds:[00153818h]              SingleLineTest(); 00000021  call        dword ptr ds:[0015380Ch]              MultiLineTest(); 00000027  call        dword ptr ds:[00153818h]  0000002d  pop         ebp          } 0000002e  ret  

Amazingly, the addresses chosen by the JIT are identical in the last 4 digits, even though it allegedly processed them in the opposite order. Not sure I believe that any more.

More digging is necessary. I think it was mentioned that the code before the loop wasn't exactly the same in both versions? Going to investigate.

Here's the "slow" version of SingleLineTest (and I checked, the last digits of the function address haven't changed).

            Stopwatch stopwatch = new Stopwatch(); 00000000  push        ebp  00000001  mov         ebp,esp  00000003  push        edi  00000004  push        esi  00000005  push        ebx  00000006  mov         ecx,7A5A2C68h  0000000b  call        FFF91EA0  00000010  mov         esi,eax  00000012  mov         dword ptr [esi+4],0  00000019  mov         dword ptr [esi+8],0  00000020  mov         byte ptr [esi+14h],0  00000024  mov         dword ptr [esi+0Ch],0  0000002b  mov         dword ptr [esi+10h],0              stopwatch.Start(); 00000032  cmp         byte ptr [esi+14h],0  00000036  jne         00000047  00000038  call        7A22B314  0000003d  mov         dword ptr [esi+0Ch],eax  00000040  mov         dword ptr [esi+10h],edx  00000043  mov         byte ptr [esi+14h],1              int count = 0; 00000047  xor         edi,edi              for (uint i = 0; i < 1000000000; ++i) { 00000049  xor         edx,edx                  count += i % 16 == 0 ? 1 : 0; 0000004b  mov         eax,edx  0000004d  and         eax,0Fh  00000050  test        eax,eax  00000052  je          00000058  00000054  xor         eax,eax  00000056  jmp         0000005D  00000058  mov         eax,1  0000005d  add         edi,eax              for (uint i = 0; i < 1000000000; ++i) { 0000005f  inc         edx  00000060  cmp         edx,3B9ACA00h  00000066  jb          0000004B              }             stopwatch.Stop(); 00000068  mov         ecx,esi  0000006a  call        7A23F2C0              Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds); 0000006f  mov         ecx,797C29B4h  00000074  call        FFF91EA0  00000079  mov         ecx,eax  0000007b  mov         dword ptr [ecx+4],edi  0000007e  mov         ebx,ecx  00000080  mov         ecx,797BA240h  00000085  call        FFF91EA0  0000008a  mov         edi,eax  0000008c  mov         ecx,esi  0000008e  call        7A23ABE8  00000093  push        edx  00000094  push        eax  00000095  push        0  00000097  push        2710h  0000009c  call        783247EC  000000a1  mov         dword ptr [edi+4],eax  000000a4  mov         dword ptr [edi+8],edx  000000a7  mov         esi,edi  000000a9  call        793C6F40  000000ae  push        ebx  000000af  push        esi  000000b0  mov         ecx,eax  000000b2  mov         edx,dword ptr ds:[03392034h]  000000b8  mov         eax,dword ptr [ecx]  000000ba  mov         eax,dword ptr [eax+3Ch]  000000bd  call        dword ptr [eax+1Ch]  000000c0  pop         ebx          } 000000c1  pop         esi  000000c2  pop         edi  000000c3  pop         ebp  000000c4  ret  

And the "fast" version:

            Stopwatch stopwatch = new Stopwatch(); 00000000  push        ebp  00000001  mov         ebp,esp  00000003  push        edi  00000004  push        esi  00000005  push        ebx  00000006  mov         ecx,7A5A2C68h  0000000b  call        FFE11F70  00000010  mov         esi,eax  00000012  mov         ecx,esi  00000014  call        7A1068BC              stopwatch.Start(); 00000019  cmp         byte ptr [esi+14h],0  0000001d  jne         0000002E  0000001f  call        7A12B3E4  00000024  mov         dword ptr [esi+0Ch],eax  00000027  mov         dword ptr [esi+10h],edx  0000002a  mov         byte ptr [esi+14h],1              int count = 0; 0000002e  xor         edi,edi              for (uint i = 0; i < 1000000000; ++i) { 00000030  xor         edx,edx                  count += i % 16 == 0 ? 1 : 0; 00000032  mov         eax,edx  00000034  and         eax,0Fh  00000037  test        eax,eax  00000039  je          0000003F  0000003b  xor         eax,eax  0000003d  jmp         00000044  0000003f  mov         eax,1  00000044  add         edi,eax              for (uint i = 0; i < 1000000000; ++i) { 00000046  inc         edx  00000047  cmp         edx,3B9ACA00h  0000004d  jb          00000032              }             stopwatch.Stop(); 0000004f  mov         ecx,esi  00000051  call        7A13F390              Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds); 00000056  mov         ecx,797C29B4h  0000005b  call        FFE11F70  00000060  mov         ecx,eax  00000062  mov         dword ptr [ecx+4],edi  00000065  mov         ebx,ecx  00000067  mov         ecx,797BA240h  0000006c  call        FFE11F70  00000071  mov         edi,eax  00000073  mov         ecx,esi  00000075  call        7A13ACB8  0000007a  push        edx  0000007b  push        eax  0000007c  push        0  0000007e  push        2710h  00000083  call        782248BC  00000088  mov         dword ptr [edi+4],eax  0000008b  mov         dword ptr [edi+8],edx  0000008e  mov         esi,edi  00000090  call        792C7010  00000095  push        ebx  00000096  push        esi  00000097  mov         ecx,eax  00000099  mov         edx,dword ptr ds:[03562030h]  0000009f  mov         eax,dword ptr [ecx]  000000a1  mov         eax,dword ptr [eax+3Ch]  000000a4  call        dword ptr [eax+1Ch]  000000a7  pop         ebx          } 000000a8  pop         esi  000000a9  pop         edi  000000aa  pop         ebp  000000ab  ret  

Just the loops, fast on the left, slow on the right:

00000030  xor         edx,edx                 00000049  xor         edx,edx  00000032  mov         eax,edx                 0000004b  mov         eax,edx  00000034  and         eax,0Fh                 0000004d  and         eax,0Fh  00000037  test        eax,eax                 00000050  test        eax,eax  00000039  je          0000003F                00000052  je          00000058  0000003b  xor         eax,eax                 00000054  xor         eax,eax  0000003d  jmp         00000044                00000056  jmp         0000005D  0000003f  mov         eax,1                   00000058  mov         eax,1  00000044  add         edi,eax                 0000005d  add         edi,eax  00000046  inc         edx                     0000005f  inc         edx  00000047  cmp         edx,3B9ACA00h           00000060  cmp         edx,3B9ACA00h  0000004d  jb          00000032                00000066  jb          0000004B  

The instructions are identical (being relative jumps, the machine code is identical even though the disassembly shows different addresses), but the alignment is different. There are three jumps. the je loading a constant 1 is aligned in the slow version and not in the fast version, but it hardly matters, since that jump is only taken 1/16 of the time. The other two jumps ( jmp after loading a constant zero, and jb repeating the entire loop) are taken millions more times, and are aligned in the "fast" version.

I think this is the smoking gun.

like image 166
Ben Voigt Avatar answered Oct 09 '22 09:10

Ben Voigt