Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimize function return values in C and C++ on x86-64?

The x86-64 ABI specifies two return registers: rax and rdx, both 64-bits (8 bytes) in size.

Assuming that x86-64 is the only targeted platform, which of these two functions:

uint64_t f(uint64_t * const secondReturnValue) {
    /* Calculate a and b. */
    *secondReturnValue = b;
    return a;
}

std::pair<uint64_t, uint64_t> g() {
    /* Calculate a and b, same as in f() above. */
    return { a, b };
}

would yield better performance, given the current state of C/C++ compilers targeting x86-64? Are there any pitfalls performance-wise using one or the other version? Are compilers (GCC, Clang) always able to optimize the std::pair to be returned in rax and rdx?

UPDATE: Generally, returning a pair is faster if the compiler optimizes out the std::pair methods (examples of binary output with GCC 5.3.0 and Clang 3.8.0). If f() is not inlined, the compiler must generate code to write a value to memory, e.g:

movq b, (%rdi)
movq a, %rax
retq

But in case of g() it suffices for the compiler to do:

movq a, %rax
movq b, %rdx
retq

Because instructions for writing values to memory are generally slower than instructions for writing values to registers, the second version should be faster.

like image 760
jotik Avatar asked Aug 19 '14 11:08

jotik


People also ask

How do I enable optimization in gcc?

GCC has a range of optimization levels, plus individual options to enable or disable particular optimizations. The overall compiler optimization level is controlled by the command line option -On, where n is the required optimization level, as follows: -O0 . (default).

What optimization does gcc do?

The compiler optimizes to reduce the size of the binary instead of execution speed. If you do not specify an optimization option, gcc attempts to reduce the compilation time and to make debugging always yield the result expected from reading the source code.

What is O3 in gcc?

-O3 option turns on more expensive optimizations, such as function inlining, in addition to all the optimizations of the lower levels '-O2' and '-O1'. The '-O3' optimization level may increase the speed of the resulting executable, but can also increase its size.

What is gcc O2?

-O2 GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. The compiler does not perform loop unrolling or function inlining when you specify. -O3 Full optimization as in -O2; also uses more aggressive automatic inlining of subprograms within a unit and attempts to vectorize loops.


2 Answers

Since the ABI specifies that in some particular cases two registers have to be used for the 2-word result any conforming compiler has to obey that rule.

However, for such tiny functions I guess that most of the performance will come from inlining.

You may want to compile and link with g++ -flto -O2 using link-time optimizations.

I guess that the second function (returning a pair thru 2 registers) might be slightly faster, and that perhaps in some situations the GCC compiler could inline and optimize the first into the second.

But you really should benchmark if you care that much.

like image 142
Basile Starynkevitch Avatar answered Nov 09 '22 22:11

Basile Starynkevitch


Note that the ABI specifies packing any small struct into registers for passing/returning (if it contains only integer types). This means that returning a std::pair<uint32_t, uint32_t> means the values have to be shift+ORed into rax.

This is probably still better than a round trip through memory, because setting up space for a pointer, and passing that pointer as an extra arg, has some overhead. (Other than that, though, a round-trip through L1 cache is pretty cheap, like ~5c latency. The store/load are almost certainly going to hit in L1 cache, because stack memory is used all the time. Even if it misses, store-forwarding can still happen, so execution doesn't stall until the ROB fills because the store can't retire. See Agner Fog's microarch guide and other stuff at the x86 tag wiki.)

Anyway, here's the kind of code you get from gcc 5.3 -O2, using functions that take args instead of returning compile-time constant values (which would lead to movabs rax, 0x...):

#include <cstdint>
#include <utility>
#define type_t uint32_t

type_t f(type_t * const secondReturnValue, type_t x) {
    *secondReturnValue = x+4;
    return x+2;
}
    lea     eax, [rsi+4]           # LEA is an add-and-shift instruction that uses memory-operand syntax and encoding
    mov     DWORD PTR [rdi], eax
    lea     eax, [rsi+2]
    ret

std::pair<type_t, type_t> g(type_t x) { return {x+2, x+4}; }
    lea     eax, [rdi+4]
    lea     edx, [rdi+2]
    sal     rax, 32
    or      rax, rdx
    ret

type_t use_pair(std::pair<type_t, type_t> pair) {
    return pair.second + pair.first;
}
    mov     rax, rdi
    shr     rax, 32
    add     eax, edi
    ret

So it's really not bad at all. Two or three insns in the caller and callee to pack and unpack a pair of uint32_t values. Nowhere near as good as returning a pair of uint64_t values, though.

If you're specifically optimizing for x86-64, and care what happens for non-inlined functions with multiple return values, then prefer returning std::pair<uint64_t, uint64_t> (or int64_t, obviously), even if you assign those pairs to narrower integers in the caller. Note that in the x32 ABI (-mx32), pointers are only 32bits. Don't assume pointers are 64bit when optimizing for x86-64, if you care about that ABI.

If either member of the pair is 64bit, they use separate registers. It doesn't do anything stupid like splitting one value between the high half of one reg and the low half of another.

like image 45
Peter Cordes Avatar answered Nov 09 '22 20:11

Peter Cordes