Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

gcc simple arithmetics loop performance

Tags:

c

gcc

The problem: one evidently extra line of a code speeds up a program nearly twice.

This is rather difficult to formulate an original problem, it comes from a bounds check elimination algorithm. So, just some simple test which I can't understand.

One evidently extra line of a code leads to speedup a program nearly twice.

There is the following source:

#include <stdlib.h>
#include <stdio.h>

int main(void)
{
   long i = 0, a = 0, x = 0;
   int  up = 200000000;

   int *values = malloc(sizeof(int)*up);

   for (i = 0; i < up ; ++i)
   {
        values[i]=i % 2;
   }
   for (i = 0; i < up  ; ++i)
   {
      x  =  (a & i);
#ifdef FAST
      x = 0;
#endif
      a += values[x];
   }
   printf ("a=%ld\n", a);
   return 0;
}/*main*/

In this example the value of 'a' is always 0. The line x = 0; is extra.

But, (look -- NO ANY OPTIMIZATIONS!)
$gcc -O0 -o short short.c && time ./short
a=0
real 0m2.808s
user 0m2.196s
sys 0m0.596s

$gcc -O0 -DFAST -o short short.c && time ./short
a=0
real 0m1.869s
user 0m1.260s
sys 0m0.608s

And, this is reproducible on many compilers / optimization options and program variations.

Moreover, they really result in the same assembler code except of putting this stupid extra 0 to some register! E.g.:

gcc -S -O0 -DFAST short.c && mv short.s shortFAST.s
gcc -S -O0 short.c && mv short.s shortSLOW.s
diff shortFAST.s shortSLOW.s
55d54
< movq $0, -24(%rbp)

And, a bit after -- the same effect on some (all I was able to test) other compilers / languages (including Java JIT). The only thing was shared -- x86-64 architecture. Was tested on both Intel and AMD processors...

like image 332
Mikhail Tentyukov Avatar asked Oct 20 '14 20:10

Mikhail Tentyukov


Video Answer


1 Answers

Short answer: Storing a 0 eliminates a read-after-write dependency in one of the loops.

Details:

I thought this was an interesting question, and although you focused on the O0 optimization level, the same speedup is seen at O3 as well. But looking at O0 makes it easier to focus on what the processor is doing to optimize the code rather than the compiler, because as you noted the resulting assembly code only differs by 1 instruction.

The assembly code for the loop of interest is shown below

  movq  $0, -32(%rbp)
  jmp .L4
.L5:     
  movq  -32(%rbp), %rax
  movq  -24(%rbp), %rdx
  andq  %rdx, %rax     
  movq  %rax, -16(%rbp)
  movq  $0, -16(%rbp)     ;; This instruction in FAST but not SLOW
  movq  -16(%rbp), %rax
  leaq  0(,%rax,4), %rdx
  movq  -8(%rbp), %rax  
  addq  %rdx, %rax      
  movl  (%rax), %eax    
  cltq                  
  addq  %rax, -24(%rbp) 
  addq  $1, -32(%rbp) 
.L4:                    
  movl  -36(%rbp), %eax 
  cltq                  
  cmpq  -32(%rbp), %rax 
  jg  .L5      

Running with perf stat on my system I get the following results:

Results for slow code

Performance counter stats for './slow_o0':

   1827.438670 task-clock                #    0.999 CPUs utilized          
           155 context-switches          #    0.085 K/sec                  
             1 CPU-migrations            #    0.001 K/sec                  
       195,448 page-faults               #    0.107 M/sec                  
 6,675,246,466 cycles                    #    3.653 GHz                    
 4,391,690,661 stalled-cycles-frontend   #   65.79% frontend cycles idle   
 1,609,321,845 stalled-cycles-backend    #   24.11% backend  cycles idle   
 7,157,837,211 instructions              #    1.07  insns per cycle        
                                         #    0.61  stalled cycles per insn
   490,110,757 branches                  #  268.195 M/sec                  
       178,287 branch-misses             #    0.04% of all branches        

   1.829712061 seconds time elapsed

Results for fast code

 Performance counter stats for './fast_o0':

   1109.451910 task-clock                #    0.998 CPUs utilized          
            95 context-switches          #    0.086 K/sec                  
             1 CPU-migrations            #    0.001 K/sec                  
       195,448 page-faults               #    0.176 M/sec                  
 4,067,613,078 cycles                    #    3.666 GHz                    
 1,784,131,209 stalled-cycles-frontend   #   43.86% frontend cycles idle   
   438,447,105 stalled-cycles-backend    #   10.78% backend  cycles idle   
 7,356,892,998 instructions              #    1.81  insns per cycle        
                                         #    0.24  stalled cycles per insn
   489,945,197 branches                  #  441.610 M/sec                  
       176,136 branch-misses             #    0.04% of all branches        

   1.111398442 seconds time elapsed

So you can see that even though the "fast" code executes more instructions it has fewer stalls. When an out-of-order CPU (like most x64 architectures) is executing code it keeps track of dependencies between instructions. A waiting instruction can be bypassed by another instruction if the operands are ready.

In this example the critical point is likely this instruction sequence:

  andq  %rdx, %rax
  movq  %rax, -16(%rbp)
  movq  $0, -16(%rbp)     ;; This instruction in FAST but not SLOW
  movq  -16(%rbp), %rax  
  leaq  0(,%rax,4), %rdx
  movq  -8(%rbp), %rax    

In the fast code the movq -8(%rbp), %rax instruction will get the result from movq $0, -16(%rbp) forwarded to it and it will be able to execute sooner. Whereas the slower version will have to wait for the movq %rax, -16(%rbp) which has more dependencies between loop iterations.

Note that without knowing more about the specific microarchitecture this analysis is probably too simplistic. But I suspect the underlying cause is this dependency and that performing the store of 0 (the movq $0, -16(%rbp) instruction) allows the CPU to perform more aggressive speculation when executing the code sequence.

like image 108
Gabriel Southern Avatar answered Oct 01 '22 16:10

Gabriel Southern