Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Helping GCC with auto-vectorisation

I have a shader I need to optimise (with lots of vector operations) and I am experimenting with SSE instructions in order to better understand the problem.

I have some very simple sample code. With the USE_SSE define it uses explicit SSE intrinsics; without it I'm hoping GCC will do the work for me. Auto-vectorisation feels a bit finicky but I'm hoping it will save me some hair.

Compiler and platform is: gcc 4.7.1 (tdm64), target x86_64-w64-mingw32 and Windows 7 on Ivy Bridge.

Here's the test code:

/*
    Include all the SIMD intrinsics.
*/
#ifdef USE_SSE
#include <x86intrin.h>
#endif
#include <cstdio>

#if   defined(__GNUG__) || defined(__clang__) 
    /* GCC & CLANG */

    #define SSVEC_FINLINE __attribute__((always_inline))

#elif defined(_WIN32) && defined(MSC_VER) 
    /* MSVC. */

    #define SSVEC_FINLINE __forceinline

#else
#error Unsupported platform.
#endif


#ifdef USE_SSE

    typedef __m128 vec4f;

    inline void addvec4f(vec4f &a, vec4f const &b)
    {
        a = _mm_add_ps(a, b);
    }

#else

    typedef float vec4f[4];

    inline void addvec4f(vec4f &a, vec4f const &b)
    {
        a[0] = a[0] + b[0];
        a[1] = a[1] + b[1];
        a[2] = a[2] + b[2];
        a[3] = a[3] + b[3];
    }

#endif

int main(int argc, char *argv[])
{
    int const count = 1e7;

    #ifdef USE_SSE
    printf("Using SSE.\n");
    #else
    printf("Not using SSE.\n");
    #endif

    vec4f data = {1.0f, 1.0f, 1.0f, 1.0f};

    for (int i = 0; i < count; ++i)
    {
        vec4f val = {0.1f, 0.1f, 0.1f, 0.1f};
        addvec4f(data, val);
    }

    float result[4] = {0};
    #ifdef USE_SSE
    _mm_store_ps(result, data);
    #else
    result[0] = data[0];
    result[1] = data[1];
    result[2] = data[2];
    result[3] = data[3];
    #endif

    printf("Result: %f %f %f %f\n", result[0], result[1], result[2], result[3]);

    return 0;
}

This is compiled with:

g++ -O3 ssetest.cpp -o nossetest.exe
g++ -O3 -DUSE_SSE ssetest.cpp -o ssetest.exe

Apart from the explicit SSE-version being a bit quicker there is no difference in output.

Here's the assembly for the loop, first explicit SSE:

.L3:
subl    $1, %eax
addps   %xmm1, %xmm0
jne .L3

It inlined the call. Nice, more or less just a straight up _mm_add_ps.

Array version:

.L3:
subl    $1, %eax
addss   %xmm0, %xmm1
addss   %xmm0, %xmm2
addss   %xmm0, %xmm3
addss   %xmm0, %xmm4
jne .L3

It is using SSE math alright, but on each array member. Not really desirable.

My question is, how can I help GCC so that it can better optimise the array version of vec4f?

Any Linux specific tips is helpful too, that's where the real code will run.

like image 512
Skurmedel Avatar asked Dec 15 '22 12:12

Skurmedel


1 Answers

This LockLess article on Auto-vectorization with gcc 4.7 is hands down the best article I have ever seen and I have spent a while looking for good articles on similar topics. They also have a lot of other articles that you may find very useful on similar subjects dealing all manners of low level software development.

like image 189
Shafik Yaghmour Avatar answered Dec 28 '22 23:12

Shafik Yaghmour