Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What VS debugger does so that increment operator performs faster than doing nothing?

Here is the self-explanatory code (performing an operation a billion times):

int k = 0;

Stopwatch sw = new Stopwatch();
sw.Start();
for (int a = 0; a < 1000; a++)
    for (int b = 0; b < 1000; b++)
        for (int c = 0; c < 1000; c++)
            k++;

sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);

sw = new Stopwatch();
sw.Start();

for (int a = 0; a < 1000; a++)
    for (int b = 0; b < 1000; b++)
        for (int c = 0; c < 1000; c++)
            ; // NO-OP

sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);

The results are (at least on my computer) somewhere around (in milliseconds)

2168
2564

The second is always about half a second longer.

How is it possible that incrementing a variable a billion times runs longer than doing a no-op the same number of times?

EDIT: This happens only on DEBUG. Release does this correctly, the first one lasts longer, at least on my computer. As pointed in comments, someone experienced this problem even in RELEASE build. But what happens on DEBUG that creates this effect?

like image 549
Kornelije Petak Avatar asked Sep 18 '12 06:09

Kornelije Petak


People also ask

Which increment operator is faster?

Pre-increment is faster than post-increment because post increment keeps a copy of previous (existing) value and adds 1 in the existing value while pre-increment is simply adds 1 without keeping the existing value.

Which is faster increment or decrement operator?

Increment is always faster than decrement.

Which one is faster i ++ or ++ i?

In C++ the faster is ++i; due to its object. However some compilers may optimize the post-increment operator.

Which executes faster between i ++ increment operator and II 1?

I++ is faster!


2 Answers

The problem is as Azodious mentioned, you can't use debug mode to meassure the time because it will not be accurate.

With release mode on, I get the following numbers:

Incrementing k: 445

NOP: 402

There are 4 more IL instructions in the incrementing version:

IL_0001:  ldc.i4.0    
IL_0002:  stloc.0     
IL_0003:  ldc.i4.0    
IL_0004:  stloc.1     
IL_0005:  br.s        IL_003B
IL_0007:  ldc.i4.0    
IL_0008:  stloc.2     
IL_0009:  br.s        IL_0029
IL_000B:  ldc.i4.0    
IL_000C:  stloc.3     
IL_000D:  br.s        IL_0017
IL_000F:  ldloc.0     
IL_0010:  ldc.i4.1    
IL_0011:  add         
IL_0012:  stloc.0     
IL_0013:  ldloc.3     
IL_0014:  ldc.i4.1    
IL_0015:  add         
IL_0016:  stloc.3     
IL_0017:  ldloc.3     
IL_0018:  ldc.i4      E8 03 00 00 
IL_001D:  clt         
IL_001F:  stloc.s     04 
IL_0021:  ldloc.s     04 
IL_0023:  brtrue.s    IL_000F
IL_0025:  ldloc.2     
IL_0026:  ldc.i4.1    
IL_0027:  add         
IL_0028:  stloc.2     
IL_0029:  ldloc.2     
IL_002A:  ldc.i4      E8 03 00 00 
IL_002F:  clt         
IL_0031:  stloc.s     04 
IL_0033:  ldloc.s     04 
IL_0035:  brtrue.s    IL_000B
IL_0037:  ldloc.1     
IL_0038:  ldc.i4.1    
IL_0039:  add         
IL_003A:  stloc.1     
IL_003B:  ldloc.1     
IL_003C:  ldc.i4      E8 03 00 00 
IL_0041:  clt         
IL_0043:  stloc.s     04 
IL_0045:  ldloc.s     04 
IL_0047:  brtrue.s    IL_0007

The NOP-verison has an equal amount of branches but one less add:

IL_0001:  ldc.i4.0    
IL_0002:  stloc.0     
IL_0003:  ldc.i4.0    
IL_0004:  stloc.1     
IL_0005:  br.s        IL_0037
IL_0007:  ldc.i4.0    
IL_0008:  stloc.2     
IL_0009:  br.s        IL_0025
IL_000B:  ldc.i4.0    
IL_000C:  stloc.3     
IL_000D:  br.s        IL_0013
IL_000F:  ldloc.3     
IL_0010:  ldc.i4.1    
IL_0011:  add         
IL_0012:  stloc.3     
IL_0013:  ldloc.3     
IL_0014:  ldc.i4      E8 03 00 00 
IL_0019:  clt         
IL_001B:  stloc.s     04 
IL_001D:  ldloc.s     04 
IL_001F:  brtrue.s    IL_000F
IL_0021:  ldloc.2     
IL_0022:  ldc.i4.1    
IL_0023:  add         
IL_0024:  stloc.2     
IL_0025:  ldloc.2     
IL_0026:  ldc.i4      E8 03 00 00 
IL_002B:  clt         
IL_002D:  stloc.s     04 
IL_002F:  ldloc.s     04 
IL_0031:  brtrue.s    IL_000B
IL_0033:  ldloc.1     
IL_0034:  ldc.i4.1    
IL_0035:  add         
IL_0036:  stloc.1     
IL_0037:  ldloc.1     
IL_0038:  ldc.i4      E8 03 00 00 
IL_003D:  clt         
IL_003F:  stloc.s     04 
IL_0041:  ldloc.s     04 
IL_0043:  brtrue.s    IL_0007

These are compiled without optimization on because I want to see exactly what's going on.

The only difference between them in reality is:

IL_0012:  stloc.0     
IL_0013:  ldloc.3     
IL_0014:  ldc.i4.1    
IL_0015:  add  

Simply put: You're getting weird numbers because you're in debug mode.

like image 107
Filip Ekberg Avatar answered Oct 22 '22 03:10

Filip Ekberg


Beyond testing the wrong code, a core mistake you made was assuming that you measured the cost of the increment operator. You didn't, you measured the cost of the for() loops. Which take a lot more cpu cycles than the increment.

An issue with a for() loop is that the cpu is forced to branch, jump back to the beginning of the loop. Modern cpus do not like branching much, they are optimized to execute code sequentially. A side effect of the pipe-line, a core architectural implementation detail designed to make a processor execute code fast. A branch may force the processor to flush the pipe-line, throwing away a lot of work that turned out to be useless. Lots of resources are allocated in a cpu design to reduce the cost of having to flush the pipe-line. A core part is the branch predictor, it tries to guess up front which way a branch is going to go so it can fill the pipe-line with instructions that may be executed. Guessing wrong is very expensive. You don't have too much to fear from this if your for() loop is long enough.

Another issue with modern processors is that they are very sensitive to the alignment of the branch target. In other words, the address of the instruction of the start of the loop. If it is mis-aligned, not at an address that's divisible by 4 or 8 then the prefetch unit needs extra cycles to start decoding the correct instruction. That's an implementation detail that the jitter needs to take care of, it may have to insert extra NOP instructions to get the instruction aligned. The x86 jitter does not perform that optimization, the x64 jitter does.

An observable side-effect of alignment problems is that swapping the two pieces of code may affect your measurement.

Benchmarking code is a perilous adventure on modern cpus, the odds that you'll actually get in real code what you observed by profiling a synthetic version of the code are not good. Differences of 15% or less are not statistically meaningful.

like image 24
Hans Passant Avatar answered Oct 22 '22 04:10

Hans Passant