Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do GCC and Clang pop on both branches instead of only once? (Factoring parts of the epilogue out of tail-duplication)

GCC and Clang both compile

bool pred();
void f();
void g();

void h() {
    if (pred()) {
        f();
    } else {
        g();
    }
}

to some variation of

# Clang -Os output.  -O3 is the same
h():
    push    rax
    call    pred()@PLT
    test    al, al
    je      .LBB0_2
    pop     rax
    jmp     f()@PLT
.LBB0_2:
    pop     rax
    jmp     g()@PLT

when optimizing for code size. You can see this on Compiler Explorer.

Is there a reason for emitting a pop instruction on both branches instead of emitting it only once before je?

For example popping into a different call-clobbered register to avoid destroying the pred() return value, or using add rsp, 8 (which is maybe not actually faster on modern CPUs in this case since it needs a stack sync uop.)

# hand-written example of what compilers could be doing
h():
    push    rax             # align the stack
    call    pred()@PLT
    pop     rcx             # clean up the stack, restoring the stack pointer to its initial state
    test    al, al
    je      .LBB0_2
    jmp     f()@PLT        # tailcall f
.LBB0_2:
    jmp     g()@PLT        # tailcall g
like image 381
Bernardo Sulzbach Avatar asked Sep 13 '25 16:09

Bernardo Sulzbach


2 Answers

You're right, it could call / pop rcx / test al,al / ...

In this case both paths just tailcall without any other work. It would be smaller static code size, and not obviously better or worse in any other way. GCC/clang should be doing what you suggest at -O3 as well.

It's not always possible to destroy the stack frame before branching and setting up the args (if any) for a tailcall, so there's a bunch of things a compiler would have to check on to do this safely in the general case. That kind of reason is common for missed optimizations in tiny simple functions: code that looks for them has to be correct for non-tiny functions, and the benefit has to be worth the compile time, and maintenance cost of the code in the GCC / LLVM source.

It does put more branch instructions closer to each other, which can be problematic for branch prediction in some CPUs. (Even unconditional branches need some prediction to avoid stalling the front-end, since code-fetch is pipelined with decode.) Prediction accuracy can be lower when a lot of branches are nearby. pop is only a 1-byte instruction so probably doesn't make much difference.

This is a missed optimization you can report on https://gcc.gnu.org/bugzilla/ and https://github.com/llvm/llvm-project/issues/ if there aren't already existing duplicates. Relevant search terms might include "tail duplication", "epilogue", "tear down stack frame".

Finding this optimization is a bit like the opposite of shrink-wrap optimization. (Which is when you do the prologue after an early-out branch that might mean it's not needed.) This would be tearing down the stack frame early, while there's still some work left to do, only involving values in call-clobbered registers.


BTW, there's another missed optimization here: x86-64 direct jmp rel32 has the same range as jcc rel32, so it could be using JCC as a tailcall (GCC bug 82516 has a short reply from Richard Biener which sheds some light on why GCC misses it.)

# Hand-written with both optimizations applied
h:
    push    rax         # Align the stack
    call    pred()@PLT
    pop     rcx         # Clean up the stack, restoring the stack
                        # pointer to its initial state
    test    al, al
    je      g()@PLT
    jmp     f()@PLT     # Tailcall f

Unless you compile with -fno-plt, then you get jmp qword ptr [rip + f()@GOTPCREL] # TAILCALL which has no JCC equivalent, instead of jmp f()@PLT.

(If f and g are present at link time, the linker can relax the indirect call to addr32 jmp f (with the 67h addr32 prefix byte just filling space to make it the same length as the indirect jump it's replacing. Just like how it can relax jumps/calls to f@PLT into just f if its statically linked, not in a different shared object.)

Stuff like __attribute__((visibility("hidden"))) could tell the compiler f and g are internal to the thing it's compiling, either the main executable or a shared library, so it can use direct calls in the first place.

like image 184
Peter Cordes Avatar answered Sep 15 '25 06:09

Peter Cordes


I cannot speak for GCC as I don't have the personal experience or the tools to easily inspect GCC's optimization passes.

As for Clang; it's important to understand that the overwhelming majority of optimizations occur in the middle-end of the compiler, i.e., within LLVM IR and irrespective of platform-specifics and calling conventions. The LLVM IR at the end of all optimization passes looks as follows:

define dso_local void @h()() local_unnamed_addr {
entry:
  %call = tail call noundef zeroext i1 @pred()()
  br i1 %call, label %if.then, label %if.else

if.then:                                          ; preds = %entry
  tail call void @f()()
  ret void

if.else:                                          ; preds = %entry
  tail call void @g()()
  ret void
}

This IR is as optimal as it gets at the end of the middle-end.

The POP instructions are generated by the Prologue/Epilogue Insertion & Frame Finalization (prologepilog) pass (see this Compiler Explorer example) for both calls. Note that the push and pop instructions are merely used to align the stack; the use of rax is arbitrary.

While it's possible in principle to eliminate common instructions between multiple branches, I guess LLVM developers don't consider it worth the effort. This makes sense; this late into the optimization pipeline, you're scraping the bottom of the barrel optimizations-wise, and only eliminating a few unnecessary instructions related to stack alignment.

like image 43
Jan Schultke Avatar answered Sep 15 '25 06:09

Jan Schultke