Here's some code which GCC 6 and 7 fail to optimize when using std::array
:
#include <array>
static constexpr size_t my_elements = 8;
class Foo
{
public:
#ifdef C_ARRAY
typedef double Vec[my_elements] alignas(32);
#else
typedef std::array<double, my_elements> Vec alignas(32);
#endif
void fun1(const Vec&);
Vec v1{{}};
};
void Foo::fun1(const Vec& __restrict__ v2)
{
for (unsigned i = 0; i < my_elements; ++i)
{
v1[i] += v2[i];
}
}
Compiling the above with g++ -std=c++14 -O3 -march=haswell -S -DC_ARRAY
produces nice code:
vmovapd ymm0, YMMWORD PTR [rdi]
vaddpd ymm0, ymm0, YMMWORD PTR [rsi]
vmovapd YMMWORD PTR [rdi], ymm0
vmovapd ymm0, YMMWORD PTR [rdi+32]
vaddpd ymm0, ymm0, YMMWORD PTR [rsi+32]
vmovapd YMMWORD PTR [rdi+32], ymm0
vzeroupper
That's basically two unrolled iterations of adding four doubles at a time via 256-bit registers. But if you compile without -DC_ARRAY
, you get a huge mess starting with this:
mov rax, rdi
shr rax, 3
neg rax
and eax, 3
je .L7
The code generated in this case (using std::array
instead of a plain C array) seems to check for alignment of the input array--even though it is specified in the typedef as aligned to 32 bytes.
It seems that GCC doesn't understand that the contents of an std::array
are aligned the same as the std::array
itself. This breaks the assumption that using std::array
instead of C arrays does not incur a runtime cost.
Is there something simple I'm missing that would fix this? So far I came up with an ugly hack:
void Foo::fun2(const Vec& __restrict__ v2)
{
typedef double V2 alignas(Foo::Vec);
const V2* v2a = static_cast<const V2*>(&v2[0]);
for (unsigned i = 0; i < my_elements; ++i)
{
v1[i] += v2a[i];
}
}
Also note: if my_elements
is 4 instead of 8, the problem does not occur. If you use Clang, the problem does not occur.
You can see it live here: https://godbolt.org/g/IXIOst
GCC has a range of optimization levels, plus individual options to enable or disable particular optimizations. The overall compiler optimization level is controlled by the command line option -On, where n is the required optimization level, as follows: -O0 . (default).
Use the command-line option -O0 (-[capital o][zero]) to disable optimization, and -S to get assembly file. Look here to see more gcc command-line options.
-O2 GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. The compiler does not perform loop unrolling or function inlining when you specify. -O3 Full optimization as in -O2; also uses more aggressive automatic inlining of subprograms within a unit and attempts to vectorize loops.
The compiler optimizes to reduce the size of the binary instead of execution speed. If you do not specify an optimization option, gcc attempts to reduce the compilation time and to make debugging always yield the result expected from reading the source code.
Interestingly, if you replace v1[i] += v2a[i];
with v1._M_elems[i] += v2._M_elems[i];
(which is obviously not portable), gcc manages to optimize the std::array case as well as the case of the C array.
Possible interpretation: in the gcc dumps (-fdump-tree-all-all
), one can see MEM[(struct FooD.25826 *)this_7(D) clique 1 base 0].v1D.25832[i_15]
in the C array case, and MEM[(const value_typeD.25834 &)v2_7(D) clique 1 base 1][_1]
for std::array. That is, in the second case, gcc may have forgotten that this is part of type Foo and only remembers that it is accessing a double.
This is an abstraction penalty that comes from all the inline functions one has to go through to finally see the array access. Clang still manages to vectorize nicely (even after removing alignas!). This likely means that clang vectorizes without caring about alignment, and indeed it uses instructions like vmovupd
that do not require an aligned address.
The hack you found, casting to Vec, is another way to let the compiler see, when it handles the memory access, that the type being handled is aligned. For a regular std::array::operator[], the memory access happens inside a member function of std::array, which doesn't have any clue that *this
happens to be aligned.
Gcc also has a builtin to let the compiler know about alignment:
const double*v2a=static_cast<const double*>(__builtin_assume_aligned(v2.data(),32));
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With