The x86-64 ABI specifies two return registers: rax
and rdx
, both 64-bits (8 bytes) in size.
Assuming that x86-64 is the only targeted platform, which of these two functions:
uint64_t f(uint64_t * const secondReturnValue) {
/* Calculate a and b. */
*secondReturnValue = b;
return a;
}
std::pair<uint64_t, uint64_t> g() {
/* Calculate a and b, same as in f() above. */
return { a, b };
}
would yield better performance, given the current state of C/C++ compilers targeting x86-64? Are there any pitfalls performance-wise using one or the other version? Are compilers (GCC, Clang) always able to optimize the std::pair
to be returned in rax
and rdx
?
UPDATE: Generally, returning a pair is faster if the compiler optimizes out the std::pair
methods (examples of binary output with GCC 5.3.0 and Clang 3.8.0). If f()
is not inlined, the compiler must generate code to write a value to memory, e.g:
movq b, (%rdi)
movq a, %rax
retq
But in case of g()
it suffices for the compiler to do:
movq a, %rax
movq b, %rdx
retq
Because instructions for writing values to memory are generally slower than instructions for writing values to registers, the second version should be faster.
GCC has a range of optimization levels, plus individual options to enable or disable particular optimizations. The overall compiler optimization level is controlled by the command line option -On, where n is the required optimization level, as follows: -O0 . (default).
The compiler optimizes to reduce the size of the binary instead of execution speed. If you do not specify an optimization option, gcc attempts to reduce the compilation time and to make debugging always yield the result expected from reading the source code.
-O3 option turns on more expensive optimizations, such as function inlining, in addition to all the optimizations of the lower levels '-O2' and '-O1'. The '-O3' optimization level may increase the speed of the resulting executable, but can also increase its size.
-O2 GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. The compiler does not perform loop unrolling or function inlining when you specify. -O3 Full optimization as in -O2; also uses more aggressive automatic inlining of subprograms within a unit and attempts to vectorize loops.
Since the ABI specifies that in some particular cases two registers have to be used for the 2-word result any conforming compiler has to obey that rule.
However, for such tiny functions I guess that most of the performance will come from inlining.
You may want to compile and link with g++ -flto -O2
using link-time optimizations.
I guess that the second function (returning a pair thru 2 registers) might be slightly faster, and that perhaps in some situations the GCC compiler could inline and optimize the first into the second.
But you really should benchmark if you care that much.
Note that the ABI specifies packing any small struct into registers for passing/returning (if it contains only integer types). This means that returning a std::pair<uint32_t, uint32_t>
means the values have to be shift+ORed into rax
.
This is probably still better than a round trip through memory, because setting up space for a pointer, and passing that pointer as an extra arg, has some overhead. (Other than that, though, a round-trip through L1 cache is pretty cheap, like ~5c latency. The store/load are almost certainly going to hit in L1 cache, because stack memory is used all the time. Even if it misses, store-forwarding can still happen, so execution doesn't stall until the ROB fills because the store can't retire. See Agner Fog's microarch guide and other stuff at the x86 tag wiki.)
Anyway, here's the kind of code you get from gcc 5.3 -O2
, using functions that take args instead of returning compile-time constant values (which would lead to movabs rax, 0x...
):
#include <cstdint>
#include <utility>
#define type_t uint32_t
type_t f(type_t * const secondReturnValue, type_t x) {
*secondReturnValue = x+4;
return x+2;
}
lea eax, [rsi+4] # LEA is an add-and-shift instruction that uses memory-operand syntax and encoding
mov DWORD PTR [rdi], eax
lea eax, [rsi+2]
ret
std::pair<type_t, type_t> g(type_t x) { return {x+2, x+4}; }
lea eax, [rdi+4]
lea edx, [rdi+2]
sal rax, 32
or rax, rdx
ret
type_t use_pair(std::pair<type_t, type_t> pair) {
return pair.second + pair.first;
}
mov rax, rdi
shr rax, 32
add eax, edi
ret
So it's really not bad at all. Two or three insns in the caller and callee to pack and unpack a pair of uint32_t
values. Nowhere near as good as returning a pair of uint64_t
values, though.
If you're specifically optimizing for x86-64, and care what happens for non-inlined functions with multiple return values, then prefer returning std::pair<uint64_t, uint64_t>
(or int64_t
, obviously), even if you assign those pairs to narrower integers in the caller. Note that in the x32 ABI (-mx32
), pointers are only 32bits. Don't assume pointers are 64bit when optimizing for x86-64, if you care about that ABI.
If either member of the pair is 64bit, they use separate registers. It doesn't do anything stupid like splitting one value between the high half of one reg and the low half of another.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With