Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does GCC fail to optimize unless the return value has a name?

Consider this code:

#include <array>

class C
{
    std::array<char, 7> a{};
    int b{};
};

C slow()
{
    return {};
}

C fast()
{
    C c;
    return c;
}

GCC 6 through 9 produce very bloated code for slow():

slow():
        xor     eax, eax
        mov     DWORD PTR [rsp-25], 0
        mov     BYTE PTR [rsp-21], 0
        mov     edx, DWORD PTR [rsp-24]
        mov     DWORD PTR [rsp-32], 0
        mov     WORD PTR [rsp-28], ax
        mov     BYTE PTR [rsp-26], 0
        mov     rax, QWORD PTR [rsp-32]
        ret
fast():
        xor     eax, eax
        xor     edx, edx
        ret

Is there a difference in meaning between the two functions? Clang emits code like fast() for both, while GCC 4-5 do a better job than 6-9, but not quite optimal either.

Build flags: -std=c++11 -O3

Demo: https://godbolt.org/z/rPNG9o


Submitted as a GCC bug based on the feedback here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90883

like image 703
John Zwinck Avatar asked Jun 13 '19 10:06

John Zwinck


People also ask

How do I know if GCC is not optimized?

Compiler specific pragma gcc provides pragma GCC as a way to control temporarily the compiler behavior. By using pragma GCC optimize("O0") , the optimization level can be set to zero, which means absolutely no optimize for gcc.

What is default GCC optimization level?

GCC has a range of optimization levels, plus individual options to enable or disable particular optimizations. The overall compiler optimization level is controlled by the command line option -On, where n is the required optimization level, as follows: -O0 . (default).

Why should code optimization be controlled by a compile time flag?

Turning on optimization flags makes the compiler attempt to improve the performance and/or code size at the expense of compilation time and possibly the ability to debug the program. The compiler performs optimization based on the knowledge it has of the program.


3 Answers

The two functions are equivalents: the returned object (more precisely, the result object of an hypothetical call to these functions) is initialized by initializing each member using its default member initializer.

For slow:

  • The pr-value result of the call to slow is copy-inizialized with {} as the initializer (stmt.return)
  • so the resulting object is list-initialized ([dcl.init]/17.1);
  • which lead us to aggregate-initializion ([dcl.init.list]/3.4)

=> so all members of the result object of a call to slow are initialized with their default member initializer dcl.init.aggr]/5.4.

For fast:

  • First we assume that copy elision is performed [class.copy.elision]/1.1
  • so the result object of the function call is default initialized [dcl.init]/12
  • so the implicitly declared default constructor is called [dcl.init]/7.1.
  • this default constructor is equivalent to C(){} ([class.default.ctor]/4)

=> so all members of the result object of a call to slow are initialized with their default member initializer [class.base.init]/9.1

The resulting assemblies of the two function are functionally equivalent. So the assembly produced by Gcc is standard compliant.

In the case of slow, the assembly is just sub-optimal. The object is returned accordingly to the SystemV x86 abi on two registers: rax and rdx (edx). First it zeroes an object conceptualy of class C on the stack at addresses [rsp-32]. It zeroes a the padding byte between a and b and b. Then it copies this initialized part of the stack on the registers. The way it zeroes the stack is just suboptimal, and whatsoever all these operations are equivalent to the 2 xor operations of fast assembly. So that is just an obvious bug.

like image 60
Oliv Avatar answered Oct 12 '22 00:10

Oliv


This is not really the complete answer but it might give a clue. As I suspected there is a subtle difference in meaning to fast and slow which probably sends the compiler down different paths. You can see this if you make the copy constructor private.

https://godbolt.org/z/FMIRe3

#include <array>

class C
{
    std::array<char, 7> a{};

    public:
    C(){}

    private:
    C(const C & c){}
};

// Compiles
C slow()
{
    return {};
}

// Does not compile
C fast()
{
    C c;
    return c;
}

Even with copy ellision fast still requires the copy constructor to be there where as slow is returning an initialization list which explicitly constructs the return value by the caller. These may or may not end up doing the same thing but I believe the compiler has to do some crunching to determine if this is the case.

There is a detailed blog post that gives some interesting background on this topic

https://akrzemi1.wordpress.com/2018/05/16/rvalues-redefined/

However the behaviour has changed in C++17

Whereas

#include <array>

class C
{
    std::array<char, 7> a{};

    public:
    C(){}

    private:
    C(const C & c){}
};

C slow()
{
    return {};
}

C fast()
{
    return C();
}

fast would fail to compile under C++11 it now compiles under C++17

https://godbolt.org/z/JG2PkD

The reason is that the meaning of return C() changes from returning a temporary to explicitly constructing the object in the frame of the caller.

So now in C++17 there is a big difference between

C fast(){
    C c;
    return c;
}

and

C fast(){
    return C();
}

because in the second one you don't even need a copy or move constructor to be available.

https://godbolt.org/z/i2eZnf

Definitely not C++ 101

like image 1
bradgonesurfing Avatar answered Oct 12 '22 00:10

bradgonesurfing


The GCC maintainers agreed this was a bug (missed optimization), and it's fixed in trunk for x86_64 (ARM may be fixed later): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90883

like image 1
John Zwinck Avatar answered Oct 11 '22 22:10

John Zwinck