Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does adding local variables make .NET code slower

Why does commenting out the first two lines of this for loop and uncommenting the third result in a 42% speedup?

int count = 0; for (uint i = 0; i < 1000000000; ++i) {     var isMultipleOf16 = i % 16 == 0;     count += isMultipleOf16 ? 1 : 0;     //count += i % 16 == 0 ? 1 : 0; } 

Behind the timing is vastly different assembly code: 13 vs. 7 instructions in the loop. The platform is Windows 7 running .NET 4.0 x64. Code optimization is enabled, and the test app was run outside VS2010. [Update: Repro project, useful for verifying project settings.]

Eliminating the intermediate boolean is a fundamental optimization, one of the simplest in my 1980's era Dragon Book. How did the optimization not get applied when generating the CIL or JITing the x64 machine code?

Is there a "Really compiler, I would like you to optimize this code, please" switch? While I sympathize with the sentiment that premature optimization is akin to the love of money, I could see the frustration in trying to profile a complex algorithm that had problems like this scattered throughout its routines. You'd work through the hotspots but have no hint of the broader warm region that could be vastly improved by hand tweaking what we normally take for granted from the compiler. I sure hope I'm missing something here.

Update: Speed differences also occur for x86, but depend on the order that methods are just-in-time compiled. See Why does JIT order affect performance?

Assembly code (as requested):

    var isMultipleOf16 = i % 16 == 0; 00000037  mov         eax,edx  00000039  and         eax,0Fh  0000003c  xor         ecx,ecx  0000003e  test        eax,eax  00000040  sete        cl      count += isMultipleOf16 ? 1 : 0; 00000043  movzx       eax,cl  00000046  test        eax,eax  00000048  jne         0000000000000050  0000004a  xor         eax,eax  0000004c  jmp         0000000000000055  0000004e  xchg        ax,ax  00000050  mov         eax,1  00000055  lea         r8d,[rbx+rax]  
    count += i % 16 == 0 ? 1 : 0; 00000037  mov         eax,ecx  00000039  and         eax,0Fh  0000003c  je          0000000000000042  0000003e  xor         eax,eax  00000040  jmp         0000000000000047  00000042  mov         eax,1  00000047  lea         edx,[rbx+rax]  
like image 838
Edward Brey Avatar asked Apr 29 '12 03:04

Edward Brey


2 Answers

Question should be "Why do I see such a difference on my machine?". I cannot reproduce such a huge speed difference and suspect there is something specific to your environment. Very difficult to tell what it can be though. Can be some (compiler) options you have set some time ago and forgot about them.

I have create a console application, rebuild in Release mode (x86) and run outside VS. Results are virtually identical, 1.77 seconds for both methods. Here is the exact code:

static void Main(string[] args) {     Stopwatch sw = new Stopwatch();     sw.Start();     int count = 0;      for (uint i = 0; i < 1000000000; ++i)     {         // 1st method         var isMultipleOf16 = i % 16 == 0;         count += isMultipleOf16 ? 1 : 0;          // 2nd method         //count += i % 16 == 0 ? 1 : 0;     }      sw.Stop();     Console.WriteLine(string.Format("Ellapsed {0}, count {1}", sw.Elapsed, count));     Console.ReadKey(); } 

Please, anyone who has 5 minutes copy the code, rebuild, run outside VS and post results in comments to this answer. I'd like to avoid saying "it works on my machine".

EDIT

To be sure I have created a 64 bit Winforms application and the results are similar as in the the question - the first method is slower (1.57 sec) than the second one (1.05 sec). The difference I observe is 33% - still a lot. Seems there is a bug in .NET4 64 bit JIT compiler.

like image 57
Maciej Avatar answered Sep 20 '22 22:09

Maciej


I can't speak to the .NET compiler, or its optimizations, or even WHEN it performs its optimizations.

But in this specific case, if the compiler folded that boolean variable in to the actual statement, and you were to try and debug this code, the optimized code would not match the code as written. You would not be able to single step over the isMulitpleOf16 assignment and check it value.

Thats just one example of where the optimization may well be turned off. There could be others. The optimization may happen during the load phase of the code, rather than the code generation phase from the CLR.

The modern runtimes are pretty complicated, especially if you throw in JIT and dynamic optimization over run time. I feel grateful the code does what it says at all sometimes.

like image 40
Will Hartung Avatar answered Sep 19 '22 22:09

Will Hartung