Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do compilers insist on using a callee-saved register here?

Consider this C code:

void foo(void);

long bar(long x) {
    foo();
    return x;
}

When I compile it on GCC 9.3 with either -O3 or -Os, I get this:

bar:
        push    r12
        mov     r12, rdi
        call    foo
        mov     rax, r12
        pop     r12
        ret

The output from clang is identical except for choosing rbx instead of r12 as the callee-saved register.

However, I want/expect to see assembly that looks more like this:

bar:
        push    rdi
        call    foo
        pop     rax
        ret

Since you have to push something to the stack anyway, it seems shorter, simpler, and probably faster to just push your value there, instead of pushing some arbitrary callee-saved register's value there and then storing your value in that register. Ditto for the inverse after call foo when you're putting things back.

Is my assembly wrong? Is it somehow less efficient than messing with an extra register? If the answer to both of those are "no", then why don't either GCC or clang do it this way?

Godbolt link.


Edit: Here's a less trivial example, to show it happens even if the variable is meaningfully used:

long foo(long);

long bar(long x) {
    return foo(x * x) - x;
}

I get this:

bar:
        push    rbx
        mov     rbx, rdi
        imul    rdi, rdi
        call    foo
        sub     rax, rbx
        pop     rbx
        ret

I'd rather have this:

bar:
        push    rdi
        imul    rdi, rdi
        call    foo
        pop     rdi
        sub     rax, rdi
        ret

This time, it's only one instruction off vs. two, but the core concept is the same.

Godbolt link.

like image 282
Joseph Sible-Reinstate Monica Avatar asked Apr 22 '20 21:04

Joseph Sible-Reinstate Monica


People also ask

What is the point of callee saved registers?

Callee-saved registers (AKA non-volatile registers, or call-preserved) are used to hold long-lived values that should be preserved across calls.

What is caller callee convention and why is it needed?

Calling conventions constrain both callers and callees. A caller is a function that calls another function; a callee is a function that was called. The currently-executing function is a callee, but not a caller. For concreteness, we learn the x86-64 calling conventions for Linux.

What are caller saved and callee saved registers?

A caller-save register must be saved and restored around any call to a subprogram. In contrast, for a callee-save register, a caller need do no extra work at a call site (the callee saves and restores the register if it is used).

Which of the following are callee saved registers?

The registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are considered nonvolatile (callee-saved). For example, a function taking 5 integer arguments will take the first to fourth in registers, and the fifth will be pushed on top of the shadow space.


1 Answers

TL:DR:

  • Compiler internals are probably not set up to look for this optimization easily, and it's probably only useful around small functions, not inside large functions between calls.
  • Inlining to create large functions is a better solution most of the time
  • There can be a latency vs. throughput tradeoff if foo happens not to save/restore RBX.

Compilers are complex pieces of machinery. They're not "smart" like a human, and expensive algorithms to find every possible optimization are often not worth the cost in extra compile time.

I reported this as GCC bug 69986 - smaller code possible with -Os by using push/pop to spill/reload back in 2016; there's been no activity or replies from GCC devs. :/

Slightly related: GCC bug 70408 - reusing the same call-preserved register would give smaller code in some cases - compiler devs told me it would take a huge amount of work for GCC to be able to do that optimization because it requires picking order of evaluation of two foo(int) calls based on what would make the target asm simpler.


If foo doesn't save/restore rbx itself, there's a tradeoff between throughput (instruction count) vs. an extra store/reload latency on the x -> retval dependency chain.

Compilers usually favour latency over throughput, e.g. using 2x LEA instead of imul reg, reg, 10 (3-cycle latency, 1/clock throughput), because most code averages significantly less than 4 uops / clock on typical 4-wide pipelines like Skylake. (More instructions/uops do take more space in the ROB, reducing how far ahead the same out-of-order window can see, though, and execution is actually bursty with stalls probably accounting for some of the less-than-4 uops/clock average.)

If foo does push/pop RBX, then there's not much to gain for latency. Having the restore happen just before the ret instead of just after is probably not relevant, unless there a ret mispredict or I-cache miss that delays fetching code at the return address.

Most non-trivial functions will save/restore RBX, so it's often not a good assumption that leaving a variable in RBX will actually mean it truly stayed in a register across the call. (Although randomizing which call-preserved registers functions choose might be a good idea to mitigate this sometimes.)


So yes push rdi / pop rax would be more efficient in this case, and this is probably a missed optimization for tiny non-leaf functions, depending on what foo does and the balance between extra store/reload latency for x vs. more instructions to save/restore the caller's rbx.

It is possible for stack-unwind metadata to represent the changes to RSP here, just like if it had used sub rsp, 8 to spill/reload x into a stack slot. (But compilers don't know this optimization either, of using push to reserve space and initialize a variable. What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?. And doing that for more than one local var would lead to larger .eh_frame stack unwind metadata because you're moving the stack pointer separately with each push. That doesn't stop compilers from using push/pop to save/restore call-preserved regs, though.)


IDK if it would be worth teaching compilers to look for this optimization

It's maybe a good idea around a whole function, not across one call inside a function. And as I said, it's based on the pessimistic assumption that foo will save/restore RBX anyway. (Or optimizing for throughput if you know that latency from x to return value isn't important. But compilers don't know that and usually optimize for latency).

If you start making that pessimistic assumption in lots of code (like around single function calls inside functions), you'll start getting more cases where RBX isn't saved/restored and you could have taken advantage.

You also don't want this extra save/restore push/pop in a loop, just save/restore RBX outside the loop and use call-preserved registers in loops that make function calls. Even without loops, in the general case most functions make multiple function calls. This optimization idea could apply if you really don't use x between any of the calls, just before the first and after the last, otherwise you have a problem of maintaining 16-byte stack alignment for each call if you're doing one pop after a call, before another call.

Compilers are not great at tiny functions in general. But it's not great for CPUs either. Non-inline function calls have an impact on optimization at the best of times, unless compilers can see the internals of the callee and make more assumptions than usual. A non-inline function call is an implicit memory barrier: a caller has to assume that a function might read or write any globally-accessible data, so all such vars have to be in sync with the C abstract machine. (Escape analysis allows keeping locals in registers across calls if their address hasn't escaped the function.) Also, the compiler has to assume that the call-clobbered registers are all clobbered. This sucks for floating point in x86-64 System V, which has no call-preserved XMM registers.

Tiny functions like bar() are better off inlining into their callers. Compile with -flto so this can happen even across file boundaries in most cases. (Function pointers and shared-library boundaries can defeat this.)


I think one reason compilers haven't bothered to try to do these optimizations is that it would require a whole bunch of different code in the compiler internals, different from the normal stack vs. register-allocation code that knows how to save call-preserved registers and use them.

i.e. it would be a lot of work to implement, and a lot of code to maintain, and if it gets over-enthusiastic about doing this it could make worse code.

And also that it's (hopefully) not significant; if it matters, you should be inlining bar into its caller, or inlining foo into bar. This is fine unless there are a lot of different bar-like functions and foo is large, and for some reason they can't inline into their callers.

like image 126
Peter Cordes Avatar answered Oct 18 '22 01:10

Peter Cordes