Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does performance differ when the only difference is in code paths that are not executed?

Test1 below is consistently 10% faster than Test2, although I'm always calling the method with a 0 argument, so the stuff that's inside the switch case - the only difference - is never being executed.

Btw after copy-and-pasting the code into a brand new project, only changing the test function's name to Main, the results are reversed. Every time I run that project, Test2 is faster.

So what are the factors involved in making this slower and faster? And: Can I influence performance deliberately in .net to my benefit?

These methods of course are doing almost nothing, so a 10% performance difference for a test that involves mainly identical virtual method calls (which, afaik the CLR is not capable of inlining - chapeau to Sun btw for implementing that in the JVM) seems huge.

N.B. This is in fact a minimal version of a real program, where multiple nested switch statements result in a huge performance difference of not just 10%, but 100% and more, apparently by the mere existence of code inside the nested switch branches which the tests never enter. (So maybe this minimal version omits some other aspects of the real program that might be involved, but it does replicate a significant and consistent performance difference)

Edit In the real program, what may matter more than the phenomenon in this question, is whether the actual case statements in the switch statement lend themselves to implementation via branch table or not - (which depends on the gaps in the case values - something I could verify this by looking at the generated IL code)

Test run on

  • .net 4.5.1
  • Release build, run via Ctrl-F5
  • Intel i7 CPU

The code:

using System;
using System.Diagnostics;

class Test1 : ITest
{
    public int Test(int a)
    {
        switch (a)
        {
            case 1: return a + a + a == 1234 ? 1 : 2;
            case 2: return 2;
        }
        return 0;
    }
}

class Test2 : ITest
{
    public int Test(int a)
    {
        switch (a)
        {
            case 1: return 1;
            case 2: return 2;
        }
        return 0;
    }
}

class Program
{
    static void Main(string[] args)
    {
        const long iterations = 200000000;
        var test1 = new Test1();
        var test2 = new Test2();

        while (true)
        {
            var sw1 = Stopwatch.StartNew();
            for (long i = 0; i < iterations; i++)
                test1.Test(0);
            sw1.Stop();


            var sw2 = Stopwatch.StartNew();
            for (long i = 0; i < iterations; i++)
                test2.Test(0);
            sw2.Stop();

            var iterPerUsec1 = iterations / sw1.Elapsed.TotalMilliseconds / 1000;
            var iterPerUsec2 = iterations / sw2.Elapsed.TotalMilliseconds / 1000;
            Console.WriteLine("iterations per usec: " + (int) iterPerUsec1 + " / " + (int) iterPerUsec2 + " ratio: " + iterPerUsec1/iterPerUsec2);
        }
    }
}

interface ITest
{
    int Test(int a);
}

Here's the output of a typical run where Test1 is consistently in fact more than 12% faster:

iterations per usec: 369 / 342 ratio: 1.07656329512607
iterations per usec: 367 / 314 ratio: 1.16820632522335
iterations per usec: 372 / 337 ratio: 1.10255744679504
iterations per usec: 374 / 342 ratio: 1.09248387354978
iterations per usec: 367 / 329 ratio: 1.11451205881061
iterations per usec: 375 / 340 ratio: 1.10041698470293
iterations per usec: 373 / 314 ratio: 1.19033461920118
iterations per usec: 366 / 334 ratio: 1.09808424282708
iterations per usec: 372 / 314 ratio: 1.18497411681768
iterations per usec: 377 / 342 ratio: 1.10482425370152
iterations per usec: 380 / 346 ratio: 1.09794853154766
iterations per usec: 385 / 342 ratio: 1.12737583603649
iterations per usec: 376 / 327 ratio: 1.15024393718844
iterations per usec: 374 / 332 ratio: 1.12400483908544
iterations per usec: 383 / 341 ratio: 1.12106159857722
iterations per usec: 380 / 345 ratio: 1.10267634674555
iterations per usec: 375 / 344 ratio: 1.09211401775982
iterations per usec: 384 / 334 ratio: 1.14958454236246
iterations per usec: 368 / 321 ratio: 1.14575850263002
iterations per usec: 378 / 335 ratio: 1.12732301818235
iterations per usec: 380 / 338 ratio: 1.12375853123099
iterations per usec: 386 / 344 ratio: 1.12213818994067
iterations per usec: 385 / 336 ratio: 1.14346447712043
iterations per usec: 374 / 345 ratio: 1.08448615249764
...
like image 568
Evgeniy Berezovsky Avatar asked Jul 09 '15 05:07

Evgeniy Berezovsky


2 Answers

Benchmarking is a fine art, it is very difficult to reliably measure very fast code like this. In general, differences of 15% or less are not statistically significant results. I can only comment on the flaw in the posted code, it is a pretty common one. Typical Heisenburgian, it is the test itself that affects the outcome.

The second for() loop is not as optimized as the first for() loop. A problem caused by the optimizer's choices on which local variables to store in a CPU register. In particular a problem when you use long in a 32-bit program, it burns up two CPU registers. In all likelihood, the commenters that mentioned not being able to reproduce it ran the test with the x64 jitter.

You take the pressure off the CPU register allocator by moving the tests into a separate method:

static Stopwatch runTest1(Test1 test1, long iterations) {
    var sw1 = Stopwatch.StartNew();
    for (long i = 0; i < iterations; i++)
        test1.Test(0);
    sw1.Stop();
    return sw1;
}

static Stopwatch runTest2(Test2 test2, long iterations) {
    var sw2 = Stopwatch.StartNew();
    for (long i = 0; i < iterations; i++)
        test2.Test(0);
    sw2.Stop();
    return sw2;
}

static void Main(string[] args) {
    const long iterations = 200000000;
    var test1 = new Test1();
    var test2 = new Test2();

    while (true) {
        var sw1 = runTest1(test1, iterations);
        var sw2 = runTest2(test2, iterations);
        // etc..
    }
}

And you'll now get what you expect.

This change was cued by looking at the generated machine code. Tools > Options > Debugging > General > untick the Suppress JIT Optimization option. Then you can look at the optimized machine code with Debug > Windows > Disassembly. Which shows the first for() loop:

            for (long i = 0; i < iterations; i++)
00000089  xor         ebx,ebx 
0000008b  xor         esi,esi 
0000008d  mov         ecx,dword ptr [esp+4] 
00000091  xor         edx,edx 
00000093  call        dword ptr ds:[04732844h] 
00000099  add         ebx,1 
0000009c  adc         esi,0 
0000009f  test        esi,esi 
000000a1  jg          000000AD 
000000a3  jl          0000008D 
000000a5  cmp         ebx,0BEBC200h 
000000ab  jb          0000008D 

And the second loop:

            for (long i = 0; i < iterations; i++)
000000f5  mov         dword ptr [esp+2Ch],0 
000000fd  xor         ebx,ebx 
000000ff  mov         ecx,dword ptr [esp] 
00000102  xor         edx,edx 
00000104  call        dword ptr ds:[056E28B8h] 
0000010a  mov         eax,ebx 
0000010c  mov         edx,dword ptr [esp+2Ch]        ; <== here
00000110  add         eax,1 
00000113  adc         edx,0 
00000116  mov         dword ptr [esp+2Ch],edx        ; <== here
0000011a  mov         ebx,eax 
0000011c  cmp         dword ptr [esp+2Ch],0 
00000121  jg          0000012D 
00000123  jl          000000FF 
00000125  cmp         ebx,0BEBC200h 
0000012b  jb          000000FF 

I marked the instructions that made it slower. The first for() loop could store the loop variable in the esi:ebx registers and keep them there. That didn't work in the second for() loop, it ran out of available CPU registers and the top 32-bits of the loop variable had to be stored in the stack frame. That's slow.

Making a change in a program like this is not generic advice and not a universal solution. It just happened to be effective in this specific case. Only looking at the machine code can give you a cue that such a change might be useful. Which is about what it takes to hand-tune code. There's probably more to what you actually try to optimize, this benchmark is very unlikely to be representative of your actual code.

like image 58
Hans Passant Avatar answered Nov 07 '22 02:11

Hans Passant


To investigate I took your code and modified it slightly. My main loop iterates 20 times and I have removed the part where you print the speed of the two test. Instead I store the difference in ticks (sw2.ElapsedTicks - sw1.ElapsedTicks) in a preallocated array. This is just to keep the main loop as simple as possible.

At the end of the main loop I average the tick differences. A positive number means that the second test executed slower using more ticks than the first test.

I did this test with all permutations of possible test sequences: Test1 followed by Test2, Test2 followed by Test1, Test1 followed by Test1 and Test2 followed by Test2. These are my results with optimizations turned on:

 Tests | Average
-------+---------
 1 2   | 248,747
 2 1   | 313,372
 1 1   | 234.812
 2 2   | 210.533

So no matter how I run the tests the second test in the loop is always slower than the first.

My conclusion is that for some reason the first test in the main loop will execute slightly faster. I do not know why and I do not always get consistent results but at least this does not seem to be related to the code in Test1 and Test2.

like image 29
Martin Liversage Avatar answered Nov 07 '22 01:11

Martin Liversage