In the assembly of the C++ source below. Why is RAX pushed to the stack?
RAX, as I understand it from the ABI could contain anything from the calling function. But we save it here, and then later move the stack back by 8 bytes. So the RAX on the stack is, I think only relevant for the std::__throw_bad_function_call()
operation ... ?
The code:-
#include <functional> void f(std::function<void()> a) { a(); }
Output, from gcc.godbolt.org
, using Clang 3.7.1 -O3:
f(std::function<void ()>): # @f(std::function<void ()>) push rax cmp qword ptr [rdi + 16], 0 je .LBB0_1 add rsp, 8 jmp qword ptr [rdi + 24] # TAILCALL .LBB0_1: call std::__throw_bad_function_call()
I'm sure the reason is obvious, but I'm struggling to figure it out.
Here's a tailcall without the std::function<void()>
wrapper for comparison:
void g(void(*a)()) { a(); }
The trivial:
g(void (*)()): # @g(void (*)()) jmp rdi # TAILCALL
"pop" retrieves the last value pushed from the stack. Everything you push, you MUST pop again at some point afterwards, or your code will crash almost immediately!
The PUSH instruction saves the current PRINT, USING, or ACONTROL status in push-down storage on a last-in, first-out basis. You restore this PRINT, USING, or ACONTROL status later, also on a last-in, first-out basis, by using a POP instruction.
"push" stores a constant or 64-bit register out onto the stack. The 64-bit registers are the ones like "rax" or "r8", not the 32-bit registers like "eax" or "r8d".
The RAX register is used for return values in functions regardless of whether you're working with Objective-C or Swift.
The 64-bit ABI requires that the stack is aligned to 16 bytes before a call
instruction.
call
pushes an 8-byte return address on the stack, which breaks the alignment, so the compiler needs to do something to align the stack again to a multiple of 16 before the next call
.
(The ABI design choice of requiring alignment before a call
instead of after has the minor advantage that if any args were passed on the stack, this choice makes the first arg 16B-aligned.)
Pushing a don't-care value works well, and can be more efficient than sub rsp, 8
on CPUs with a stack engine. (See the comments).
The reason push rax
is there is to align the stack back to a 16-byte boundary to conform to the 64-bit System V ABI in the case where je .LBB0_1
branch is taken. The value placed on the stack isn't relevant. Another way would have been subtracting 8 from RSP with sub rsp, 8
. The ABI states the alignment this way:
The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%rsp + 8) is always a multiple of 16 (32) when control is transferred to the function entry point. The stack pointer, %rsp, always points to the end of the latest allocated stack frame.
Prior to the call to function f
the stack was 16-byte aligned per the calling convention. After control was transferred via a CALL to f
the return address was placed on the stack misaligning the stack by 8. push rax
is a simple way of subtracting 8 from RSP and realigning it again. If the branch is taken to call std::__throw_bad_function_call()
the stack will be properly aligned for that call to work.
In the case where the comparison falls through, the stack will appear just as it did at function entry once the add rsp, 8
instruction is executed. The return address of the CALLER to function f
will now be back at the top of the stack and the stack will be misaligned by 8 again. This is what we want because a TAIL CALL is being made with jmp qword ptr [rdi + 24]
to transfer control to the function a
. This will JMP to the function not CALL it. When function a
does a RET it will return directly back to the function that called f
.
At a higher optimization level I would have expected that the compiler should be smart enough to do the comparison, and let it fall through directly to the JMP. What is at label .LBB0_1
could then align the stack to a 16-byte boundary so that call std::__throw_bad_function_call()
works properly.
As @CodyGray pointed out, if you use GCC (not CLANG) with optimization level of -O2
or higher, the code produced does seem more reasonable. GCC 6.1 output from Godbolt is:
f(std::function<void ()>): cmp QWORD PTR [rdi+16], 0 # MEM[(bool (*<T5fc5>) (union _Any_data &, const union _Any_data &, _Manager_operation) *)a_2(D) + 16B], je .L7 #, jmp [QWORD PTR [rdi+24]] # MEM[(const struct function *)a_2(D)]._M_invoker .L7: sub rsp, 8 #, call std::__throw_bad_function_call() #
This code is more in line with what I would have expected. In this case it would appear that GCC's optimizer may handle this code generation better than CLANG.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With