Why does the order in which C# methods in .NET 4.0 are just-in-time compiled affect how quickly they execute? For example, consider two equivalent methods:
public static void SingleLineTest() { Stopwatch stopwatch = new Stopwatch(); stopwatch.Start(); int count = 0; for (uint i = 0; i < 1000000000; ++i) { count += i % 16 == 0 ? 1 : 0; } stopwatch.Stop(); Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds); } public static void MultiLineTest() { Stopwatch stopwatch = new Stopwatch(); stopwatch.Start(); int count = 0; for (uint i = 0; i < 1000000000; ++i) { var isMultipleOf16 = i % 16 == 0; count += isMultipleOf16 ? 1 : 0; } stopwatch.Stop(); Console.WriteLine("Multi-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds); }
The only difference is the introduction of a local variable, which affects the assembly code generated and the loop performance. Why that is the case is a question in its own right.
Possibly even stranger is that on x86 (but not x64), the order that the methods are invoked has around a 20% impact on performance. Invoke the methods like this...
static void Main() { SingleLineTest(); MultiLineTest(); }
...and SingleLineTest
is faster. (Compile using the x86 Release configuration, ensuring that "Optimize code" setting is enabled, and run the test from outside VS2010.) But reverse the order...
static void Main() { MultiLineTest(); SingleLineTest(); }
...and both methods take the same time (almost, but not quite, as long as MultiLineTest
before). (When running this test, it's useful to add some additional calls to SingleLineTest
and MultiLineTest
to get additional samples. How many and what order doesn't matter, except for which method is called first.)
Finally, to demonstrate that JIT order is important, leave MultiLineTest
first, but force SingleLineTest
to be JITed first...
static void Main() { RuntimeHelpers.PrepareMethod(typeof(Program).GetMethod("SingleLineTest").MethodHandle); MultiLineTest(); SingleLineTest(); }
Now, SingleLineTest
is faster again.
If you turn off "Suppress JIT optimization on module load" in VS2010, you can put a breakpoint in SingleLineTest
and see that the assembly code in the loop is the same regardless of JIT order; however, the assembly code at the beginning of the method varies. But how this matters when the bulk of the time is spent in the loop is perplexing.
A sample project demonstrating this behavior is on github.
It's not clear how this behavior affects real-world applications. One concern is that it can make performance tuning volatile, depending on the order methods happen to be first called. Problems of this sort would be difficult to detect with a profiler. Once you found the hotspots and optimized their algorithms, it would be hard to know without a lot of guess and check whether additional speedup is possible by JITing methods early.
Update: See also the Microsoft Connect entry for this issue.
Because JIT production is based entirely on existing orders, it is not the most efficient system for dealing with the unexpected. A company that uses this strategy may be ill-equipped to handle a sudden surge in demand for a product.
Reduction in storage and waiting time Relatedly, the JIT system encourages faster setups of production machinery. Producing small lots of product only as required demands a faster set up of machinery, resulting in more efficiency and reduced costs. The result is higher quality raw materials and finished products.
Risk of Running Out of Stock - With JIT manufacturing, you do not carry as much stock. This is because you base your stock off of demand forecasts, and if those are incorrect, then you will not have the correct amount of stock readily available for your consumers.
Shorter Production Cycles: JIT shortens manufacturing time, which decreases lead times for customers. Reduce Product Defects: Production mistakes can be spotted faster and corrected, which results in fewer defective products.
Please note that I do not trust the "Suppress JIT optimization on module load" option, I spawn the process without debugging and attach my debugger after the JIT has run.
In the version where single-line runs faster, this is Main
:
SingleLineTest(); 00000000 push ebp 00000001 mov ebp,esp 00000003 call dword ptr ds:[0019380Ch] MultiLineTest(); 00000009 call dword ptr ds:[00193818h] SingleLineTest(); 0000000f call dword ptr ds:[0019380Ch] MultiLineTest(); 00000015 call dword ptr ds:[00193818h] SingleLineTest(); 0000001b call dword ptr ds:[0019380Ch] MultiLineTest(); 00000021 call dword ptr ds:[00193818h] 00000027 pop ebp } 00000028 ret
Note that MultiLineTest
has been placed on an 8 byte boundary, and SingleLineTest
on a 4 byte boundary.
Here's Main
for the version where both run at the same speed:
MultiLineTest(); 00000000 push ebp 00000001 mov ebp,esp 00000003 call dword ptr ds:[00153818h] SingleLineTest(); 00000009 call dword ptr ds:[0015380Ch] MultiLineTest(); 0000000f call dword ptr ds:[00153818h] SingleLineTest(); 00000015 call dword ptr ds:[0015380Ch] MultiLineTest(); 0000001b call dword ptr ds:[00153818h] SingleLineTest(); 00000021 call dword ptr ds:[0015380Ch] MultiLineTest(); 00000027 call dword ptr ds:[00153818h] 0000002d pop ebp } 0000002e ret
Amazingly, the addresses chosen by the JIT are identical in the last 4 digits, even though it allegedly processed them in the opposite order. Not sure I believe that any more.
More digging is necessary. I think it was mentioned that the code before the loop wasn't exactly the same in both versions? Going to investigate.
Here's the "slow" version of SingleLineTest
(and I checked, the last digits of the function address haven't changed).
Stopwatch stopwatch = new Stopwatch(); 00000000 push ebp 00000001 mov ebp,esp 00000003 push edi 00000004 push esi 00000005 push ebx 00000006 mov ecx,7A5A2C68h 0000000b call FFF91EA0 00000010 mov esi,eax 00000012 mov dword ptr [esi+4],0 00000019 mov dword ptr [esi+8],0 00000020 mov byte ptr [esi+14h],0 00000024 mov dword ptr [esi+0Ch],0 0000002b mov dword ptr [esi+10h],0 stopwatch.Start(); 00000032 cmp byte ptr [esi+14h],0 00000036 jne 00000047 00000038 call 7A22B314 0000003d mov dword ptr [esi+0Ch],eax 00000040 mov dword ptr [esi+10h],edx 00000043 mov byte ptr [esi+14h],1 int count = 0; 00000047 xor edi,edi for (uint i = 0; i < 1000000000; ++i) { 00000049 xor edx,edx count += i % 16 == 0 ? 1 : 0; 0000004b mov eax,edx 0000004d and eax,0Fh 00000050 test eax,eax 00000052 je 00000058 00000054 xor eax,eax 00000056 jmp 0000005D 00000058 mov eax,1 0000005d add edi,eax for (uint i = 0; i < 1000000000; ++i) { 0000005f inc edx 00000060 cmp edx,3B9ACA00h 00000066 jb 0000004B } stopwatch.Stop(); 00000068 mov ecx,esi 0000006a call 7A23F2C0 Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds); 0000006f mov ecx,797C29B4h 00000074 call FFF91EA0 00000079 mov ecx,eax 0000007b mov dword ptr [ecx+4],edi 0000007e mov ebx,ecx 00000080 mov ecx,797BA240h 00000085 call FFF91EA0 0000008a mov edi,eax 0000008c mov ecx,esi 0000008e call 7A23ABE8 00000093 push edx 00000094 push eax 00000095 push 0 00000097 push 2710h 0000009c call 783247EC 000000a1 mov dword ptr [edi+4],eax 000000a4 mov dword ptr [edi+8],edx 000000a7 mov esi,edi 000000a9 call 793C6F40 000000ae push ebx 000000af push esi 000000b0 mov ecx,eax 000000b2 mov edx,dword ptr ds:[03392034h] 000000b8 mov eax,dword ptr [ecx] 000000ba mov eax,dword ptr [eax+3Ch] 000000bd call dword ptr [eax+1Ch] 000000c0 pop ebx } 000000c1 pop esi 000000c2 pop edi 000000c3 pop ebp 000000c4 ret
And the "fast" version:
Stopwatch stopwatch = new Stopwatch(); 00000000 push ebp 00000001 mov ebp,esp 00000003 push edi 00000004 push esi 00000005 push ebx 00000006 mov ecx,7A5A2C68h 0000000b call FFE11F70 00000010 mov esi,eax 00000012 mov ecx,esi 00000014 call 7A1068BC stopwatch.Start(); 00000019 cmp byte ptr [esi+14h],0 0000001d jne 0000002E 0000001f call 7A12B3E4 00000024 mov dword ptr [esi+0Ch],eax 00000027 mov dword ptr [esi+10h],edx 0000002a mov byte ptr [esi+14h],1 int count = 0; 0000002e xor edi,edi for (uint i = 0; i < 1000000000; ++i) { 00000030 xor edx,edx count += i % 16 == 0 ? 1 : 0; 00000032 mov eax,edx 00000034 and eax,0Fh 00000037 test eax,eax 00000039 je 0000003F 0000003b xor eax,eax 0000003d jmp 00000044 0000003f mov eax,1 00000044 add edi,eax for (uint i = 0; i < 1000000000; ++i) { 00000046 inc edx 00000047 cmp edx,3B9ACA00h 0000004d jb 00000032 } stopwatch.Stop(); 0000004f mov ecx,esi 00000051 call 7A13F390 Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds); 00000056 mov ecx,797C29B4h 0000005b call FFE11F70 00000060 mov ecx,eax 00000062 mov dword ptr [ecx+4],edi 00000065 mov ebx,ecx 00000067 mov ecx,797BA240h 0000006c call FFE11F70 00000071 mov edi,eax 00000073 mov ecx,esi 00000075 call 7A13ACB8 0000007a push edx 0000007b push eax 0000007c push 0 0000007e push 2710h 00000083 call 782248BC 00000088 mov dword ptr [edi+4],eax 0000008b mov dword ptr [edi+8],edx 0000008e mov esi,edi 00000090 call 792C7010 00000095 push ebx 00000096 push esi 00000097 mov ecx,eax 00000099 mov edx,dword ptr ds:[03562030h] 0000009f mov eax,dword ptr [ecx] 000000a1 mov eax,dword ptr [eax+3Ch] 000000a4 call dword ptr [eax+1Ch] 000000a7 pop ebx } 000000a8 pop esi 000000a9 pop edi 000000aa pop ebp 000000ab ret
Just the loops, fast on the left, slow on the right:
00000030 xor edx,edx 00000049 xor edx,edx 00000032 mov eax,edx 0000004b mov eax,edx 00000034 and eax,0Fh 0000004d and eax,0Fh 00000037 test eax,eax 00000050 test eax,eax 00000039 je 0000003F 00000052 je 00000058 0000003b xor eax,eax 00000054 xor eax,eax 0000003d jmp 00000044 00000056 jmp 0000005D 0000003f mov eax,1 00000058 mov eax,1 00000044 add edi,eax 0000005d add edi,eax 00000046 inc edx 0000005f inc edx 00000047 cmp edx,3B9ACA00h 00000060 cmp edx,3B9ACA00h 0000004d jb 00000032 00000066 jb 0000004B
The instructions are identical (being relative jumps, the machine code is identical even though the disassembly shows different addresses), but the alignment is different. There are three jumps. the je
loading a constant 1
is aligned in the slow version and not in the fast version, but it hardly matters, since that jump is only taken 1/16 of the time. The other two jumps ( jmp
after loading a constant zero, and jb
repeating the entire loop) are taken millions more times, and are aligned in the "fast" version.
I think this is the smoking gun.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With