I am disassembling this code on llvm clang Apple LLVM version 8.0.0 (clang-800.0.42.1):
int main() {
float a=0.151234;
float b=0.2;
float c=a+b;
printf("%f", c);
}
I compiled with no -O specifications, but I also tried with -O0 (gives the same) and -O2 (actually computes the value and stores it precomputed)
The resulting disassembly is the following (I removed the parts that are not relevant)
-> 0x100000f30 <+0>: pushq %rbp
0x100000f31 <+1>: movq %rsp, %rbp
0x100000f34 <+4>: subq $0x10, %rsp
0x100000f38 <+8>: leaq 0x6d(%rip), %rdi
0x100000f3f <+15>: movss 0x5d(%rip), %xmm0
0x100000f47 <+23>: movss 0x59(%rip), %xmm1
0x100000f4f <+31>: movss %xmm1, -0x4(%rbp)
0x100000f54 <+36>: movss %xmm0, -0x8(%rbp)
0x100000f59 <+41>: movss -0x4(%rbp), %xmm0
0x100000f5e <+46>: addss -0x8(%rbp), %xmm0
0x100000f63 <+51>: movss %xmm0, -0xc(%rbp)
...
Apparently it's doing the following:
I find it inefficient because:
Given that the compiler is always right, why did it choose this strategy?
-O0
(unoptimized) is the default. It tells the compiler you want it to compile fast (short compile times), not to take extra time compiling to make efficient code.
(-O0
isn't literally no optimization; e.g. gcc will still eliminate code inside if(1 == 2){ }
blocks. Especially gcc more than most other compilers still does things like use multiplicative inverses for division at -O0
, because it still transforms your C source through multiple internal representations of the logic before eventually emitting asm.)
Plus, "the compiler is always right" is an exaggeration even at -O3
. Compilers are very good at a large scale, but minor missed-optimizations are still common within single loops. Often with very low impact, but wasted instructions (or uops) in a loop can eat up space in the out-of-order execution reordering window, and be less hyper-threading friendly when sharing a core with another thread. See C++ code for testing the Collatz conjecture faster than hand-written assembly - why? for more about beating the compiler in a simple specific case.
More importantly,-O0
also implies treating all variables similar to volatile
for consistent debugging. i.e. so you can set a breakpoint or single step and modify the value of a C variable, and then continue execution and have the program work the way you'd expect from your C source running on the C abstract machine. So the compiler can't do any constant-propagation or value-range simplification. (e.g. an integer that's known to be non-negative can simplify things using it, or make some if conditions always true or always false.)
(It's not quite as bad as volatile
: multiple references to the same variable within one statement don't always result in multiple loads; at -O0
compilers will still optimize somewhat within a single expression.)
Compilers have to specifically anti-optimize for -O0
by storing/reloading all variables to their memory address between statements. (In C and C++, every variable has an address unless it was declared with the (now obsolete) register
keyword and has never had its address taken. Optimizing away the address is possible according to the as-if rule for other variables, but isn't done at -O0
)
Unfortunately, debug-info formats can't track the location of a variable through registers, so fully consistent debugging isn't possible without this slow-and-stupid code-gen.
If you don't need this, you can compile with -Og
for light optimization, and without the anti-optimizations required for consistent debugging. The GCC manual recommends it for the usual edit/compile/run cycle, but you will get "optimized out" for many local variables with automatic storage when debugging. Globals and function args still usually have their actual values, at least at function boundaries.
Even worse, -O0
makes code that still works even if you use GDB's jump
command to continue execution at a different source line. So each C statement has to be compiled into a fully independent block of instructions. (Is it possible to "jump"/"skip" in GDB debugger?)
for()
loops can't be transformed into idiomatic (for asm) do{}while()
loops, and other restrictions.
For all the above reasons, (micro-)benchmarking un-optimized code is a huge waste of time; the results depend on silly details of how you wrote the source that don't matter when you compile with normal optimization. -O0
vs. -O3
performance is not linearly related; some code will speed up much more than others.
The bottlenecks in -O0
code will often be different from -O3
- often on a loop counter that's kept in memory, creating a ~6-cycle loop-carried dependency chain. This can create interesting effects in the compiler-generated asm like Adding a redundant assignment speeds up code when compiled without optimization (which are interesting from an asm perspective, but not for C.)
"My benchmark optimized away otherwise" is not a valid justification for looking at the performance of -O0
code.
See C loop optimization help for final assignment for an example and more details about the rabbit hole that tuning for -O0
is.
If you want to see how the compiler adds 2 variables, write a function that takes args and returns a value. Remember you only want to look at the asm, not run it, so you don't need a main
or any numeric literal values for anything that should be a runtime variable.
See also How to remove "noise" from GCC/clang assembly output? for more about this.
float foo(float a, float b) {
float c=a+b;
return c;
}
compiles with clang -O3
(on the Godbolt compiler explorer) to the expected
addss xmm0, xmm1
ret
But with -O0
it spills the args to stack memory. (Godbolt uses debug info emitted by the compiler to colour-code asm instructions according to which C statement they came from. I've added line breaks to show blocks for each statement, but you can see this with colour highlighting on the Godbolt link above. Often very handy for finding the interesting part of an inner loop in optimized compiler output.)
gcc -fverbose-asm
will put comments on every line showing the operand names as C vars. In optimized code that's often an internal tmp name, but in un-optimized code it's usual an actual variable from the C source. I've manually commented the clang output because it doesn't do that.
# clang7.0 -O0 also on Godbolt
foo:
push rbp
mov rbp, rsp # make a traditional stack frame
movss DWORD PTR [rbp-20], xmm0 # spill the register args
movss DWORD PTR [rbp-24], xmm1 # into the red zone (below RSP)
movss xmm0, DWORD PTR [rbp-20] # a
addss xmm0, DWORD PTR [rbp-24] # +b
movss DWORD PTR [rbp-4], xmm0 # store c
movss xmm0, DWORD PTR [rbp-4] # return 0
pop rbp # epilogue
ret
Fun fact: using register float c = a+b;
, the return value can stay in XMM0 between statements, instead of being spilled/reloaded. The variable has no address. (I included that version of the function in the Godbolt link.)
The register
keyword has no effect in optimized code (except making it an error to take a variable's address, like how const
on a local stops you from accidentally modifying something). I don't recommend using it, but it's interesting to see that it does actually affect un-optimized code.
__attribute__((always_inline))
can force inlining, but doesn't optimize away the copying to create the function args, let alone optimize the function into the caller.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With