GCC and Clang both compile
bool pred();
void f();
void g();
void h() {
if (pred()) {
f();
} else {
g();
}
}
to some variation of
# Clang -Os output. -O3 is the same
h():
push rax
call pred()@PLT
test al, al
je .LBB0_2
pop rax
jmp f()@PLT
.LBB0_2:
pop rax
jmp g()@PLT
when optimizing for code size. You can see this on Compiler Explorer.
Is there a reason for emitting a pop
instruction on both branches instead of emitting it only once before je
?
For example popping into a different call-clobbered register to avoid destroying the pred() return value, or using add rsp, 8
(which is maybe not actually faster on modern CPUs in this case since it needs a stack sync uop.)
# hand-written example of what compilers could be doing
h():
push rax # align the stack
call pred()@PLT
pop rcx # clean up the stack, restoring the stack pointer to its initial state
test al, al
je .LBB0_2
jmp f()@PLT # tailcall f
.LBB0_2:
jmp g()@PLT # tailcall g
You're right, it could call
/ pop rcx
/ test al,al
/ ...
In this case both paths just tailcall without any other work. It would be smaller static code size, and not obviously better or worse in any other way. GCC/clang should be doing what you suggest at -O3
as well.
It's not always possible to destroy the stack frame before branching and setting up the args (if any) for a tailcall, so there's a bunch of things a compiler would have to check on to do this safely in the general case. That kind of reason is common for missed optimizations in tiny simple functions: code that looks for them has to be correct for non-tiny functions, and the benefit has to be worth the compile time, and maintenance cost of the code in the GCC / LLVM source.
It does put more branch instructions closer to each other, which can be problematic for branch prediction in some CPUs. (Even unconditional branches need some prediction to avoid stalling the front-end, since code-fetch is pipelined with decode.) Prediction accuracy can be lower when a lot of branches are nearby. pop
is only a 1-byte instruction so probably doesn't make much difference.
This is a missed optimization you can report on https://gcc.gnu.org/bugzilla/ and https://github.com/llvm/llvm-project/issues/ if there aren't already existing duplicates. Relevant search terms might include "tail duplication", "epilogue", "tear down stack frame".
Finding this optimization is a bit like the opposite of shrink-wrap optimization. (Which is when you do the prologue after an early-out branch that might mean it's not needed.) This would be tearing down the stack frame early, while there's still some work left to do, only involving values in call-clobbered registers.
BTW, there's another missed optimization here: x86-64 direct jmp rel32
has the same range as jcc rel32
, so it could be using JCC as a tailcall (GCC bug 82516 has a short reply from Richard Biener which sheds some light on why GCC misses it.)
# Hand-written with both optimizations applied
h:
push rax # Align the stack
call pred()@PLT
pop rcx # Clean up the stack, restoring the stack
# pointer to its initial state
test al, al
je g()@PLT
jmp f()@PLT # Tailcall f
Unless you compile with -fno-plt
, then you get jmp qword ptr [rip + f()@GOTPCREL] # TAILCALL
which has no JCC equivalent, instead of jmp f()@PLT
.
(If f
and g
are present at link time, the linker can relax the indirect call to addr32 jmp f
(with the 67h
addr32 prefix byte just filling space to make it the same length as the indirect jump it's replacing. Just like how it can relax jumps/calls to f@PLT
into just f
if its statically linked, not in a different shared object.)
Stuff like __attribute__((visibility("hidden")))
could tell the compiler f
and g
are internal to the thing it's compiling, either the main executable or a shared library, so it can use direct calls in the first place.
I cannot speak for GCC as I don't have the personal experience or the tools to easily inspect GCC's optimization passes.
As for Clang; it's important to understand that the overwhelming majority of optimizations occur in the middle-end of the compiler, i.e., within LLVM IR and irrespective of platform-specifics and calling conventions. The LLVM IR at the end of all optimization passes looks as follows:
define dso_local void @h()() local_unnamed_addr {
entry:
%call = tail call noundef zeroext i1 @pred()()
br i1 %call, label %if.then, label %if.else
if.then: ; preds = %entry
tail call void @f()()
ret void
if.else: ; preds = %entry
tail call void @g()()
ret void
}
This IR is as optimal as it gets at the end of the middle-end.
The POP
instructions are generated by the Prologue/Epilogue Insertion & Frame Finalization (prologepilog) pass (see this Compiler Explorer example) for both calls. Note that the push
and pop
instructions are merely used to align the stack; the use of rax
is arbitrary.
While it's possible in principle to eliminate common instructions between multiple branches, I guess LLVM developers don't consider it worth the effort. This makes sense; this late into the optimization pipeline, you're scraping the bottom of the barrel optimizations-wise, and only eliminating a few unnecessary instructions related to stack alignment.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With