Test1
below is consistently 10% faster than Test2
, although I'm always calling the method with a 0
argument, so the stuff that's inside the switch case - the only difference - is never being executed.
Btw after copy-and-pasting the code into a brand new project, only changing the test function's name to Main
, the results are reversed. Every time I run that project, Test2
is faster.
So what are the factors involved in making this slower and faster? And: Can I influence performance deliberately in .net to my benefit?
These methods of course are doing almost nothing, so a 10% performance difference for a test that involves mainly identical virtual method calls (which, afaik the CLR is not capable of inlining - chapeau to Sun btw for implementing that in the JVM) seems huge.
N.B. This is in fact a minimal version of a real program, where multiple nested switch statements result in a huge performance difference of not just 10%, but 100% and more, apparently by the mere existence of code inside the nested switch branches which the tests never enter. (So maybe this minimal version omits some other aspects of the real program that might be involved, but it does replicate a significant and consistent performance difference)
Edit In the real program, what may matter more than the phenomenon in this question, is whether the actual case statements in the switch statement lend themselves to implementation via branch table or not - (which depends on the gaps in the case values - something I could verify this by looking at the generated IL code)
Test run on
The code:
using System;
using System.Diagnostics;
class Test1 : ITest
{
public int Test(int a)
{
switch (a)
{
case 1: return a + a + a == 1234 ? 1 : 2;
case 2: return 2;
}
return 0;
}
}
class Test2 : ITest
{
public int Test(int a)
{
switch (a)
{
case 1: return 1;
case 2: return 2;
}
return 0;
}
}
class Program
{
static void Main(string[] args)
{
const long iterations = 200000000;
var test1 = new Test1();
var test2 = new Test2();
while (true)
{
var sw1 = Stopwatch.StartNew();
for (long i = 0; i < iterations; i++)
test1.Test(0);
sw1.Stop();
var sw2 = Stopwatch.StartNew();
for (long i = 0; i < iterations; i++)
test2.Test(0);
sw2.Stop();
var iterPerUsec1 = iterations / sw1.Elapsed.TotalMilliseconds / 1000;
var iterPerUsec2 = iterations / sw2.Elapsed.TotalMilliseconds / 1000;
Console.WriteLine("iterations per usec: " + (int) iterPerUsec1 + " / " + (int) iterPerUsec2 + " ratio: " + iterPerUsec1/iterPerUsec2);
}
}
}
interface ITest
{
int Test(int a);
}
Here's the output of a typical run where Test1 is consistently in fact more than 12% faster:
iterations per usec: 369 / 342 ratio: 1.07656329512607
iterations per usec: 367 / 314 ratio: 1.16820632522335
iterations per usec: 372 / 337 ratio: 1.10255744679504
iterations per usec: 374 / 342 ratio: 1.09248387354978
iterations per usec: 367 / 329 ratio: 1.11451205881061
iterations per usec: 375 / 340 ratio: 1.10041698470293
iterations per usec: 373 / 314 ratio: 1.19033461920118
iterations per usec: 366 / 334 ratio: 1.09808424282708
iterations per usec: 372 / 314 ratio: 1.18497411681768
iterations per usec: 377 / 342 ratio: 1.10482425370152
iterations per usec: 380 / 346 ratio: 1.09794853154766
iterations per usec: 385 / 342 ratio: 1.12737583603649
iterations per usec: 376 / 327 ratio: 1.15024393718844
iterations per usec: 374 / 332 ratio: 1.12400483908544
iterations per usec: 383 / 341 ratio: 1.12106159857722
iterations per usec: 380 / 345 ratio: 1.10267634674555
iterations per usec: 375 / 344 ratio: 1.09211401775982
iterations per usec: 384 / 334 ratio: 1.14958454236246
iterations per usec: 368 / 321 ratio: 1.14575850263002
iterations per usec: 378 / 335 ratio: 1.12732301818235
iterations per usec: 380 / 338 ratio: 1.12375853123099
iterations per usec: 386 / 344 ratio: 1.12213818994067
iterations per usec: 385 / 336 ratio: 1.14346447712043
iterations per usec: 374 / 345 ratio: 1.08448615249764
...
Benchmarking is a fine art, it is very difficult to reliably measure very fast code like this. In general, differences of 15% or less are not statistically significant results. I can only comment on the flaw in the posted code, it is a pretty common one. Typical Heisenburgian, it is the test itself that affects the outcome.
The second for() loop is not as optimized as the first for() loop. A problem caused by the optimizer's choices on which local variables to store in a CPU register. In particular a problem when you use long in a 32-bit program, it burns up two CPU registers. In all likelihood, the commenters that mentioned not being able to reproduce it ran the test with the x64 jitter.
You take the pressure off the CPU register allocator by moving the tests into a separate method:
static Stopwatch runTest1(Test1 test1, long iterations) {
var sw1 = Stopwatch.StartNew();
for (long i = 0; i < iterations; i++)
test1.Test(0);
sw1.Stop();
return sw1;
}
static Stopwatch runTest2(Test2 test2, long iterations) {
var sw2 = Stopwatch.StartNew();
for (long i = 0; i < iterations; i++)
test2.Test(0);
sw2.Stop();
return sw2;
}
static void Main(string[] args) {
const long iterations = 200000000;
var test1 = new Test1();
var test2 = new Test2();
while (true) {
var sw1 = runTest1(test1, iterations);
var sw2 = runTest2(test2, iterations);
// etc..
}
}
And you'll now get what you expect.
This change was cued by looking at the generated machine code. Tools > Options > Debugging > General > untick the Suppress JIT Optimization option. Then you can look at the optimized machine code with Debug > Windows > Disassembly. Which shows the first for() loop:
for (long i = 0; i < iterations; i++)
00000089 xor ebx,ebx
0000008b xor esi,esi
0000008d mov ecx,dword ptr [esp+4]
00000091 xor edx,edx
00000093 call dword ptr ds:[04732844h]
00000099 add ebx,1
0000009c adc esi,0
0000009f test esi,esi
000000a1 jg 000000AD
000000a3 jl 0000008D
000000a5 cmp ebx,0BEBC200h
000000ab jb 0000008D
And the second loop:
for (long i = 0; i < iterations; i++)
000000f5 mov dword ptr [esp+2Ch],0
000000fd xor ebx,ebx
000000ff mov ecx,dword ptr [esp]
00000102 xor edx,edx
00000104 call dword ptr ds:[056E28B8h]
0000010a mov eax,ebx
0000010c mov edx,dword ptr [esp+2Ch] ; <== here
00000110 add eax,1
00000113 adc edx,0
00000116 mov dword ptr [esp+2Ch],edx ; <== here
0000011a mov ebx,eax
0000011c cmp dword ptr [esp+2Ch],0
00000121 jg 0000012D
00000123 jl 000000FF
00000125 cmp ebx,0BEBC200h
0000012b jb 000000FF
I marked the instructions that made it slower. The first for() loop could store the loop variable in the esi:ebx registers and keep them there. That didn't work in the second for() loop, it ran out of available CPU registers and the top 32-bits of the loop variable had to be stored in the stack frame. That's slow.
Making a change in a program like this is not generic advice and not a universal solution. It just happened to be effective in this specific case. Only looking at the machine code can give you a cue that such a change might be useful. Which is about what it takes to hand-tune code. There's probably more to what you actually try to optimize, this benchmark is very unlikely to be representative of your actual code.
To investigate I took your code and modified it slightly. My main loop iterates 20 times and I have removed the part where you print the speed of the two test. Instead I store the difference in ticks (sw2.ElapsedTicks - sw1.ElapsedTicks
) in a preallocated array. This is just to keep the main loop as simple as possible.
At the end of the main loop I average the tick differences. A positive number means that the second test executed slower using more ticks than the first test.
I did this test with all permutations of possible test sequences: Test1
followed by Test2
, Test2
followed by Test1
, Test1
followed by Test1
and Test2
followed by Test2
. These are my results with optimizations turned on:
Tests | Average -------+--------- 1 2 | 248,747 2 1 | 313,372 1 1 | 234.812 2 2 | 210.533
So no matter how I run the tests the second test in the loop is always slower than the first.
My conclusion is that for some reason the first test in the main loop will execute slightly faster. I do not know why and I do not always get consistent results but at least this does not seem to be related to the code in Test1
and Test2
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With