I'm trying to write some computationally intensive code for Windows x64 target, with SSE or the new AVX instructions, compiling in GCC 4.5.2 and 4.6.1, MinGW64 (TDM GCC build, and some custom build). My compiler options are <code>-O3 -mavx</code>. (<code>-m64</code> is implied) In short, I want to perform some lengthy computation on 4 3D vectors of packed floats. That requires 4x3=12 xmm or ymm registers for storage, and 2 or 3 registers for temporary results. This should IMHO fit snugly in the 16 available SSE (or AVX) registers available for 64bit targets. However, GCC produces a very suboptimal code with register spilling, using only registers <code>xmm0-xmm10</code> and shuffling data from and onto the stack. My question is: Is there a way to convince GCC to use all the registers <code>xmm0-xmm15</code>? To fix ideas, consider the following SSE code (for illustration only): <pre class="prettyprint"><code>void example(vect<__m128> q1, vect<__m128> q2, vect<__m128>& a1, vect<__m128>& a2) { for (int i=0; i < 10; i++) { vect<__m128> v = q2 - q1; a1 += v; // a2 -= v; q2 *= _mm_set1_ps(2.); } } </code></pre> Here <code>vect<__m128></code> is simply a <code>struct</code> of 3 <code>__m128</code>, with natural addition and multiplication by scalar. When the line <code>a2 -= v</code> is commented out, i.e. we need only 3x3 registers for storage since we are ignoring <code>a2</code>, the produced code is indeed straightforward with no moves, everything is performed in registers <code>xmm0-xmm10</code>. When I remove the comment <code>a2 -= v</code>, the code is pretty awful with a lot of shuffling between registers and stack. Even though the compiler could just use registers <code>xmm11-xmm13</code> or something. I actually haven't seen GCC use any of the registers <code>xmm11-xmm15</code> anywhere in all my code yet. What am I doing wrong? I understand that they are callee-saved registers, but this overhead is completely justified by simplifying the loop code.

Two points: <ul> <li>First, You're making a lot of assumptions. Register spilling is pretty cheap on x86 CPUs (due to fast L1 caches and register shadowing and other tricks), and the 64-bit only registers are more costly to access (in terms of larger instructions), so it may just be that GCC's version is as fast, or faster, than the one you want.</li> <li>Second, GCC, like any compiler, does the best register allocation it can. There's no "please do better register allocation" option, because if there was, it'd always be enabled. The compiler isn't trying to spite you. (Register allocation is a NP-complete problem, as I recall, so the compiler will never be able to generate a perfect solution. The best it can do is to approximate)</li> </ul> So, if you want better register allocation, you basically have two options: <ul> <li>write a better register allocator, and patch it into GCC, or</li> <li>bypass GCC and rewrite the function in assembly, so you can control exactly which registers are used when.</li> </ul>

How to force gcc to use all SSE (or AVX) registers?

I'm trying to write some computationally intensive code for Windows x64 target, with SSE or the new AVX instructions, compiling in GCC 4.5.2 and 4.6.1, MinGW64 (TDM GCC build, and some custom build). My compiler options are -O3 -mavx. (-m64 is implied)

In short, I want to perform some lengthy computation on 4 3D vectors of packed floats. That requires 4x3=12 xmm or ymm registers for storage, and 2 or 3 registers for temporary results. This should IMHO fit snugly in the 16 available SSE (or AVX) registers available for 64bit targets. However, GCC produces a very suboptimal code with register spilling, using only registers xmm0-xmm10 and shuffling data from and onto the stack. My question is:

Is there a way to convince GCC to use all the registers xmm0-xmm15?

To fix ideas, consider the following SSE code (for illustration only):

void example(vect<__m128> q1, vect<__m128> q2, vect<__m128>& a1, vect<__m128>& a2) {
    for (int i=0; i < 10; i++) {
        vect<__m128> v = q2 - q1;
        a1 += v;
//      a2 -= v;

        q2 *= _mm_set1_ps(2.);
    }
}

Here vect<__m128> is simply a struct of 3 __m128, with natural addition and multiplication by scalar. When the line a2 -= v is commented out, i.e. we need only 3x3 registers for storage since we are ignoring a2, the produced code is indeed straightforward with no moves, everything is performed in registers xmm0-xmm10. When I remove the comment a2 -= v, the code is pretty awful with a lot of shuffling between registers and stack. Even though the compiler could just use registers xmm11-xmm13 or something.

I actually haven't seen GCC use any of the registers xmm11-xmm15 anywhere in all my code yet. What am I doing wrong? I understand that they are callee-saved registers, but this overhead is completely justified by simplifying the loop code.

Does GCC use SIMD?

The GNU Compiler Collection, gcc, offers multiple ways to perform SIMD calculations.

What is sse AVX?

SSE (streaming SIMD extensions) and AVX (advanced vector extensions) are SIMD (single instruction multiple data streams) instruction sets supported by recent CPUs manufactured in Intel and AMD.

Two points:

First, You're making a lot of assumptions. Register spilling is pretty cheap on x86 CPUs (due to fast L1 caches and register shadowing and other tricks), and the 64-bit only registers are more costly to access (in terms of larger instructions), so it may just be that GCC's version is as fast, or faster, than the one you want.
Second, GCC, like any compiler, does the best register allocation it can. There's no "please do better register allocation" option, because if there was, it'd always be enabled. The compiler isn't trying to spite you. (Register allocation is a NP-complete problem, as I recall, so the compiler will never be able to generate a perfect solution. The best it can do is to approximate)

So, if you want better register allocation, you basically have two options:

write a better register allocator, and patch it into GCC, or
bypass GCC and rewrite the function in assembly, so you can control exactly which registers are used when.

Actually, what you see aren't spills, it is gcc operating on a1 and a2 in memory because it can't know if they are aliased. If you declare the last two parameters as vect<__m128>& __restrict__ GCC can and will register allocate a1 and a2.

How to force gcc to use all SSE (or AVX) registers?

Tags:

gcc

avx

64-bit

sse

register-allocation

Norbert P.

People also ask

2 Answers

jalf

user511824

Recent Activity

Donate For Us

How to force gcc to use all SSE (or AVX) registers?

Tags:

gcc

avx

64-bit

sse

register-allocation

Norbert P.

People also ask

2 Answers

jalf

user511824

Related questions

Recent Activity

Donate For Us