I'm trying to write some computationally intensive code for Windows x64 target, with SSE or the new AVX instructions, compiling in GCC 4.5.2 and 4.6.1, MinGW64 (TDM GCC build, and some custom build). My compiler options are -O3 -mavx
. (-m64
is implied)
In short, I want to perform some lengthy computation on 4 3D vectors of packed floats. That requires 4x3=12 xmm or ymm registers for storage, and 2 or 3 registers for temporary results. This should IMHO fit snugly in the 16 available SSE (or AVX) registers available for 64bit targets. However, GCC produces a very suboptimal code with register spilling, using only registers xmm0-xmm10
and shuffling data from and onto the stack. My question is:
Is there a way to convince GCC to use all the registers xmm0-xmm15
?
To fix ideas, consider the following SSE code (for illustration only):
void example(vect<__m128> q1, vect<__m128> q2, vect<__m128>& a1, vect<__m128>& a2) {
for (int i=0; i < 10; i++) {
vect<__m128> v = q2 - q1;
a1 += v;
// a2 -= v;
q2 *= _mm_set1_ps(2.);
}
}
Here vect<__m128>
is simply a struct
of 3 __m128
, with natural addition and multiplication by scalar. When the line a2 -= v
is commented out, i.e. we need only 3x3 registers for storage since we are ignoring a2
, the produced code is indeed straightforward with no moves, everything is performed in registers xmm0-xmm10
. When I remove the comment a2 -= v
, the code is pretty awful with a lot of shuffling between registers and stack. Even though the compiler could just use registers xmm11-xmm13
or something.
I actually haven't seen GCC use any of the registers xmm11-xmm15
anywhere in all my code yet. What am I doing wrong? I understand that they are callee-saved registers, but this overhead is completely justified by simplifying the loop code.
The GNU Compiler Collection, gcc, offers multiple ways to perform SIMD calculations.
SSE (streaming SIMD extensions) and AVX (advanced vector extensions) are SIMD (single instruction multiple data streams) instruction sets supported by recent CPUs manufactured in Intel and AMD.
Two points:
So, if you want better register allocation, you basically have two options:
Actually, what you see aren't spills, it is gcc operating on a1 and a2 in memory because it can't know if they are aliased. If you declare the last two parameters as vect<__m128>& __restrict__
GCC can and will register allocate a1 and a2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With