The problem: one evidently extra line of a code speeds up a program nearly twice.
This is rather difficult to formulate an original problem, it comes from a bounds check elimination algorithm. So, just some simple test which I can't understand.
One evidently extra line of a code leads to speedup a program nearly twice.
There is the following source:
#include <stdlib.h>
#include <stdio.h>
int main(void)
{
long i = 0, a = 0, x = 0;
int up = 200000000;
int *values = malloc(sizeof(int)*up);
for (i = 0; i < up ; ++i)
{
values[i]=i % 2;
}
for (i = 0; i < up ; ++i)
{
x = (a & i);
#ifdef FAST
x = 0;
#endif
a += values[x];
}
printf ("a=%ld\n", a);
return 0;
}/*main*/
In this example the value of 'a' is always 0. The line x = 0; is extra.
But, (look -- NO ANY OPTIMIZATIONS!)
$gcc -O0 -o short short.c && time ./short
a=0
real 0m2.808s
user 0m2.196s
sys 0m0.596s
$gcc -O0 -DFAST -o short short.c && time ./short
a=0
real 0m1.869s
user 0m1.260s
sys 0m0.608s
And, this is reproducible on many compilers / optimization options and program variations.
Moreover, they really result in the same assembler code except of putting this stupid extra 0 to some register! E.g.:
gcc -S -O0 -DFAST short.c && mv short.s shortFAST.s
gcc -S -O0 short.c && mv short.s shortSLOW.s
diff shortFAST.s shortSLOW.s
55d54
< movq $0, -24(%rbp)
And, a bit after -- the same effect on some (all I was able to test) other compilers / languages (including Java JIT). The only thing was shared -- x86-64 architecture. Was tested on both Intel and AMD processors...
Short answer: Storing a 0 eliminates a read-after-write dependency in one of the loops.
Details:
I thought this was an interesting question, and although you focused on the O0 optimization level, the same speedup is seen at O3 as well. But looking at O0 makes it easier to focus on what the processor is doing to optimize the code rather than the compiler, because as you noted the resulting assembly code only differs by 1 instruction.
The assembly code for the loop of interest is shown below
movq $0, -32(%rbp)
jmp .L4
.L5:
movq -32(%rbp), %rax
movq -24(%rbp), %rdx
andq %rdx, %rax
movq %rax, -16(%rbp)
movq $0, -16(%rbp) ;; This instruction in FAST but not SLOW
movq -16(%rbp), %rax
leaq 0(,%rax,4), %rdx
movq -8(%rbp), %rax
addq %rdx, %rax
movl (%rax), %eax
cltq
addq %rax, -24(%rbp)
addq $1, -32(%rbp)
.L4:
movl -36(%rbp), %eax
cltq
cmpq -32(%rbp), %rax
jg .L5
Running with perf stat
on my system I get the following results:
Results for slow code
Performance counter stats for './slow_o0':
1827.438670 task-clock # 0.999 CPUs utilized
155 context-switches # 0.085 K/sec
1 CPU-migrations # 0.001 K/sec
195,448 page-faults # 0.107 M/sec
6,675,246,466 cycles # 3.653 GHz
4,391,690,661 stalled-cycles-frontend # 65.79% frontend cycles idle
1,609,321,845 stalled-cycles-backend # 24.11% backend cycles idle
7,157,837,211 instructions # 1.07 insns per cycle
# 0.61 stalled cycles per insn
490,110,757 branches # 268.195 M/sec
178,287 branch-misses # 0.04% of all branches
1.829712061 seconds time elapsed
Results for fast code
Performance counter stats for './fast_o0':
1109.451910 task-clock # 0.998 CPUs utilized
95 context-switches # 0.086 K/sec
1 CPU-migrations # 0.001 K/sec
195,448 page-faults # 0.176 M/sec
4,067,613,078 cycles # 3.666 GHz
1,784,131,209 stalled-cycles-frontend # 43.86% frontend cycles idle
438,447,105 stalled-cycles-backend # 10.78% backend cycles idle
7,356,892,998 instructions # 1.81 insns per cycle
# 0.24 stalled cycles per insn
489,945,197 branches # 441.610 M/sec
176,136 branch-misses # 0.04% of all branches
1.111398442 seconds time elapsed
So you can see that even though the "fast" code executes more instructions it has fewer stalls. When an out-of-order CPU (like most x64 architectures) is executing code it keeps track of dependencies between instructions. A waiting instruction can be bypassed by another instruction if the operands are ready.
In this example the critical point is likely this instruction sequence:
andq %rdx, %rax
movq %rax, -16(%rbp)
movq $0, -16(%rbp) ;; This instruction in FAST but not SLOW
movq -16(%rbp), %rax
leaq 0(,%rax,4), %rdx
movq -8(%rbp), %rax
In the fast code the movq -8(%rbp), %rax
instruction will get the result from movq $0, -16(%rbp)
forwarded to it and it will be able to execute sooner. Whereas the slower version will have to wait for the movq %rax, -16(%rbp)
which has more dependencies between loop iterations.
Note that without knowing more about the specific microarchitecture this analysis is probably too simplistic. But I suspect the underlying cause is this dependency and that performing the store of 0 (the movq $0, -16(%rbp)
instruction) allows the CPU to perform more aggressive speculation when executing the code sequence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With