Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GCC fails to optimize aligned std::array like C array

Here's some code which GCC 6 and 7 fail to optimize when using std::array:

#include <array>

static constexpr size_t my_elements = 8;

class Foo
{
public:
#ifdef C_ARRAY
    typedef double Vec[my_elements] alignas(32);
#else
    typedef std::array<double, my_elements> Vec alignas(32);
#endif
    void fun1(const Vec&);
    Vec v1{{}};
};

void Foo::fun1(const Vec& __restrict__ v2)
{
    for (unsigned i = 0; i < my_elements; ++i)
    {
        v1[i] += v2[i];
    }
}

Compiling the above with g++ -std=c++14 -O3 -march=haswell -S -DC_ARRAY produces nice code:

    vmovapd ymm0, YMMWORD PTR [rdi]
    vaddpd  ymm0, ymm0, YMMWORD PTR [rsi]
    vmovapd YMMWORD PTR [rdi], ymm0
    vmovapd ymm0, YMMWORD PTR [rdi+32]
    vaddpd  ymm0, ymm0, YMMWORD PTR [rsi+32]
    vmovapd YMMWORD PTR [rdi+32], ymm0
    vzeroupper

That's basically two unrolled iterations of adding four doubles at a time via 256-bit registers. But if you compile without -DC_ARRAY, you get a huge mess starting with this:

    mov     rax, rdi
    shr     rax, 3
    neg     rax
    and     eax, 3
    je      .L7

The code generated in this case (using std::array instead of a plain C array) seems to check for alignment of the input array--even though it is specified in the typedef as aligned to 32 bytes.

It seems that GCC doesn't understand that the contents of an std::array are aligned the same as the std::array itself. This breaks the assumption that using std::array instead of C arrays does not incur a runtime cost.

Is there something simple I'm missing that would fix this? So far I came up with an ugly hack:

void Foo::fun2(const Vec& __restrict__ v2)
{
    typedef double V2 alignas(Foo::Vec);
    const V2* v2a = static_cast<const V2*>(&v2[0]);

    for (unsigned i = 0; i < my_elements; ++i)
    {
        v1[i] += v2a[i];
    }
}

Also note: if my_elements is 4 instead of 8, the problem does not occur. If you use Clang, the problem does not occur.

You can see it live here: https://godbolt.org/g/IXIOst

like image 927
John Zwinck Avatar asked Apr 27 '17 08:04

John Zwinck


People also ask

How do I enable optimization in gcc?

GCC has a range of optimization levels, plus individual options to enable or disable particular optimizations. The overall compiler optimization level is controlled by the command line option -On, where n is the required optimization level, as follows: -O0 . (default).

How do I disable gcc optimization?

Use the command-line option -O0 (-[capital o][zero]) to disable optimization, and -S to get assembly file. Look here to see more gcc command-line options.

What is gcc O2?

-O2 GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. The compiler does not perform loop unrolling or function inlining when you specify. -O3 Full optimization as in -O2; also uses more aggressive automatic inlining of subprograms within a unit and attempts to vectorize loops.

What optimization does gcc do?

The compiler optimizes to reduce the size of the binary instead of execution speed. If you do not specify an optimization option, gcc attempts to reduce the compilation time and to make debugging always yield the result expected from reading the source code.


1 Answers

Interestingly, if you replace v1[i] += v2a[i]; with v1._M_elems[i] += v2._M_elems[i]; (which is obviously not portable), gcc manages to optimize the std::array case as well as the case of the C array.

Possible interpretation: in the gcc dumps (-fdump-tree-all-all), one can see MEM[(struct FooD.25826 *)this_7(D) clique 1 base 0].v1D.25832[i_15] in the C array case, and MEM[(const value_typeD.25834 &)v2_7(D) clique 1 base 1][_1] for std::array. That is, in the second case, gcc may have forgotten that this is part of type Foo and only remembers that it is accessing a double.

This is an abstraction penalty that comes from all the inline functions one has to go through to finally see the array access. Clang still manages to vectorize nicely (even after removing alignas!). This likely means that clang vectorizes without caring about alignment, and indeed it uses instructions like vmovupd that do not require an aligned address.

The hack you found, casting to Vec, is another way to let the compiler see, when it handles the memory access, that the type being handled is aligned. For a regular std::array::operator[], the memory access happens inside a member function of std::array, which doesn't have any clue that *this happens to be aligned.

Gcc also has a builtin to let the compiler know about alignment:

const double*v2a=static_cast<const double*>(__builtin_assume_aligned(v2.data(),32));
like image 110
Marc Glisse Avatar answered Oct 08 '22 13:10

Marc Glisse