In the assembly of the C++ source below. Why is RAX pushed to the stack? RAX, as I understand it from the ABI could contain anything from the calling function. But we save it here, and then later move the stack back by 8 bytes. So the RAX on the stack is, I think only relevant for the <code>std::__throw_bad_function_call()</code> operation ... ? The code:- <pre class="prettyprint"><code>#include <functional> void f(std::function<void()> a) { a(); } </code></pre> Output, from <code>gcc.godbolt.org</code>, using Clang 3.7.1 -O3: <pre class="prettyprint"><code>f(std::function<void ()>): # @f(std::function<void ()>) push rax cmp qword ptr [rdi + 16], 0 je .LBB0_1 add rsp, 8 jmp qword ptr [rdi + 24] # TAILCALL .LBB0_1: call std::__throw_bad_function_call() </code></pre> I'm sure the reason is obvious, but I'm struggling to figure it out. Here's a tailcall without the <code>std::function<void()></code> wrapper for comparison: <pre class="prettyprint"><code>void g(void(*a)()) { a(); } </code></pre> The trivial: <pre class="prettyprint"><code>g(void (*)()): # @g(void (*)()) jmp rdi # TAILCALL </code></pre>

The 64-bit ABI requires that the stack is aligned to 16 bytes before a <code>call</code> instruction. <code>call</code> pushes an 8-byte return address on the stack, which breaks the alignment, so the compiler needs to do something to align the stack again to a multiple of 16 before the next <code>call</code>. (The ABI design choice of requiring alignment before a <code>call</code> instead of after has the minor advantage that if any args were passed on the stack, this choice makes the first arg 16B-aligned.) Pushing a don't-care value works well, and can be more efficient than <code>sub rsp, 8</code> on CPUs with a stack engine. (See the comments).

The reason <code>push rax</code> is there is to align the stack back to a 16-byte boundary to conform to the 64-bit System V ABI in the case where <code>je .LBB0_1</code> branch is taken. The value placed on the stack isn't relevant. Another way would have been subtracting 8 from RSP with <code>sub rsp, 8</code>. The ABI states the alignment this way: <blockquote> The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%rsp + 8) is always a multiple of 16 (32) when control is transferred to the function entry point. The stack pointer, %rsp, always points to the end of the latest allocated stack frame. </blockquote> Prior to the call to function <code>f</code> the stack was 16-byte aligned per the calling convention. After control was transferred via a CALL to <code>f</code> the return address was placed on the stack misaligning the stack by 8. <code>push rax</code> is a simple way of subtracting 8 from RSP and realigning it again. If the branch is taken to <code>call std::__throw_bad_function_call()</code>the stack will be properly aligned for that call to work. In the case where the comparison falls through, the stack will appear just as it did at function entry once the <code>add rsp, 8</code> instruction is executed. The return address of the CALLER to function <code>f</code> will now be back at the top of the stack and the stack will be misaligned by 8 again. This is what we want because a TAIL CALL is being made with <code>jmp qword ptr [rdi + 24]</code> to transfer control to the function <code>a</code>. This will JMP to the function not CALL it. When function <code>a</code> does a RET it will return directly back to the function that called <code>f</code>. At a higher optimization level I would have expected that the compiler should be smart enough to do the comparison, and let it fall through directly to the JMP. What is at label <code>.LBB0_1</code> could then align the stack to a 16-byte boundary so that <code>call std::__throw_bad_function_call()</code> works properly. <hr> As @CodyGray pointed out, if you use GCC (not CLANG) with optimization level of <code>-O2</code> or higher, the code produced does seem more reasonable. GCC 6.1 output from Godbolt is: <pre class="prettyprint"><code>f(std::function<void ()>): cmp QWORD PTR [rdi+16], 0 # MEM[(bool (*<T5fc5>) (union _Any_data &, const union _Any_data &, _Manager_operation) *)a_2(D) + 16B], je .L7 #, jmp [QWORD PTR [rdi+24]] # MEM[(const struct function *)a_2(D)]._M_invoker .L7: sub rsp, 8 #, call std::__throw_bad_function_call() # </code></pre> This code is more in line with what I would have expected. In this case it would appear that GCC's optimizer may handle this code generation better than CLANG.

Why does this function push RAX to the stack as the first operation?

Tags:

c++

x86

assembly

x86-64

abi

In the assembly of the C++ source below. Why is RAX pushed to the stack?

RAX, as I understand it from the ABI could contain anything from the calling function. But we save it here, and then later move the stack back by 8 bytes. So the RAX on the stack is, I think only relevant for the std::__throw_bad_function_call() operation ... ?

The code:-

#include <functional>   void f(std::function<void()> a)  {   a();  }

Output, from gcc.godbolt.org, using Clang 3.7.1 -O3:

f(std::function<void ()>):                  # @f(std::function<void ()>)         push    rax         cmp     qword ptr [rdi + 16], 0         je      .LBB0_1         add     rsp, 8         jmp     qword ptr [rdi + 24]    # TAILCALL .LBB0_1:         call    std::__throw_bad_function_call()

I'm sure the reason is obvious, but I'm struggling to figure it out.

Here's a tailcall without the std::function<void()> wrapper for comparison:

void g(void(*a)()) {   a();  }

The trivial:

g(void (*)()):             # @g(void (*)())         jmp     rdi        # TAILCALL

814

asked Jun 12 '16 11:06

JCx

2 Answers

The 64-bit ABI requires that the stack is aligned to 16 bytes before a call instruction.

call pushes an 8-byte return address on the stack, which breaks the alignment, so the compiler needs to do something to align the stack again to a multiple of 16 before the next call.

(The ABI design choice of requiring alignment before a call instead of after has the minor advantage that if any args were passed on the stack, this choice makes the first arg 16B-aligned.)

Pushing a don't-care value works well, and can be more efficient than sub rsp, 8 on CPUs with a stack engine. (See the comments).

181

answered Sep 24 '22 02:09

BeniBela

The reason push rax is there is to align the stack back to a 16-byte boundary to conform to the 64-bit System V ABI in the case where je .LBB0_1 branch is taken. The value placed on the stack isn't relevant. Another way would have been subtracting 8 from RSP with sub rsp, 8. The ABI states the alignment this way:

The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%rsp + 8) is always a multiple of 16 (32) when control is transferred to the function entry point. The stack pointer, %rsp, always points to the end of the latest allocated stack frame.

Prior to the call to function f the stack was 16-byte aligned per the calling convention. After control was transferred via a CALL to f the return address was placed on the stack misaligning the stack by 8. push rax is a simple way of subtracting 8 from RSP and realigning it again. If the branch is taken to call std::__throw_bad_function_call()the stack will be properly aligned for that call to work.

In the case where the comparison falls through, the stack will appear just as it did at function entry once the add rsp, 8 instruction is executed. The return address of the CALLER to function f will now be back at the top of the stack and the stack will be misaligned by 8 again. This is what we want because a TAIL CALL is being made with jmp qword ptr [rdi + 24] to transfer control to the function a. This will JMP to the function not CALL it. When function a does a RET it will return directly back to the function that called f.

At a higher optimization level I would have expected that the compiler should be smart enough to do the comparison, and let it fall through directly to the JMP. What is at label .LBB0_1 could then align the stack to a 16-byte boundary so that call std::__throw_bad_function_call() works properly.

As @CodyGray pointed out, if you use GCC (not CLANG) with optimization level of -O2 or higher, the code produced does seem more reasonable. GCC 6.1 output from Godbolt is:

f(std::function<void ()>):         cmp     QWORD PTR [rdi+16], 0     # MEM[(bool (*<T5fc5>) (union _Any_data &, const union _Any_data &, _Manager_operation) *)a_2(D) + 16B],         je      .L7 #,         jmp     [QWORD PTR [rdi+24]]      # MEM[(const struct function *)a_2(D)]._M_invoker .L7:         sub     rsp, 8    #,         call    std::__throw_bad_function_call()        #

This code is more in line with what I would have expected. In this case it would appear that GCC's optimizer may handle this code generation better than CLANG.

answered Sep 22 '22 02:09

Michael Petch

Related questions
                            
                                What are the uses of the type `std::nullptr_t`?
                            
                                Do I have to use the OpenGL data types (GLint, CLchar, ...) for a cross platform game?
                            
                                Writing Python bindings for C++ code that use OpenCV
                            
                                Append to a File with fstream Instead of Overwriting
                            
                                What does a>?=b mean?
                            
                                Can I comment multi-line macros?
                            
                                Passing functions in C++
                            
                                What does this mean const int*& var?
                            
                                How to set background color of window after I have registered it?
                            
                                Why is template argument deduction disabled with std::forward?
                            
                                LNK2019: unresolved external symbol _main referenced in function ___tmainCRTStartup
                            
                                avoiding the first newline in a C++11 raw string literal?
                            
                                what happens when i mix signed and unsigned types ?
                            
                                How is nth_element Implemented?
                            
                                Why does C++ allow you to move classes containing objects with deleted move operations?
                            
                                Make integer sequence unique at compile time
                            
                                Is it a good idea to return " const char * " from a function?
                            
                                Why default argument cannot be specified for an explicit template specialization?
                            
                                When do we need to have a default constructor?
                            
                                Fast multiplication/division by 2 for floats and doubles (C/C++)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With