I have a shader I need to optimise (with lots of vector operations) and I am experimenting with SSE instructions in order to better understand the problem.
I have some very simple sample code. With the USE_SSE
define it uses explicit SSE intrinsics; without it I'm hoping GCC will do the work for me. Auto-vectorisation feels a bit finicky but I'm hoping it will save me some hair.
Compiler and platform is: gcc 4.7.1 (tdm64), target x86_64-w64-mingw32 and Windows 7 on Ivy Bridge.
Here's the test code:
/*
Include all the SIMD intrinsics.
*/
#ifdef USE_SSE
#include <x86intrin.h>
#endif
#include <cstdio>
#if defined(__GNUG__) || defined(__clang__)
/* GCC & CLANG */
#define SSVEC_FINLINE __attribute__((always_inline))
#elif defined(_WIN32) && defined(MSC_VER)
/* MSVC. */
#define SSVEC_FINLINE __forceinline
#else
#error Unsupported platform.
#endif
#ifdef USE_SSE
typedef __m128 vec4f;
inline void addvec4f(vec4f &a, vec4f const &b)
{
a = _mm_add_ps(a, b);
}
#else
typedef float vec4f[4];
inline void addvec4f(vec4f &a, vec4f const &b)
{
a[0] = a[0] + b[0];
a[1] = a[1] + b[1];
a[2] = a[2] + b[2];
a[3] = a[3] + b[3];
}
#endif
int main(int argc, char *argv[])
{
int const count = 1e7;
#ifdef USE_SSE
printf("Using SSE.\n");
#else
printf("Not using SSE.\n");
#endif
vec4f data = {1.0f, 1.0f, 1.0f, 1.0f};
for (int i = 0; i < count; ++i)
{
vec4f val = {0.1f, 0.1f, 0.1f, 0.1f};
addvec4f(data, val);
}
float result[4] = {0};
#ifdef USE_SSE
_mm_store_ps(result, data);
#else
result[0] = data[0];
result[1] = data[1];
result[2] = data[2];
result[3] = data[3];
#endif
printf("Result: %f %f %f %f\n", result[0], result[1], result[2], result[3]);
return 0;
}
This is compiled with:
g++ -O3 ssetest.cpp -o nossetest.exe
g++ -O3 -DUSE_SSE ssetest.cpp -o ssetest.exe
Apart from the explicit SSE-version being a bit quicker there is no difference in output.
Here's the assembly for the loop, first explicit SSE:
.L3:
subl $1, %eax
addps %xmm1, %xmm0
jne .L3
It inlined the call. Nice, more or less just a straight up _mm_add_ps
.
Array version:
.L3:
subl $1, %eax
addss %xmm0, %xmm1
addss %xmm0, %xmm2
addss %xmm0, %xmm3
addss %xmm0, %xmm4
jne .L3
It is using SSE math alright, but on each array member. Not really desirable.
My question is, how can I help GCC so that it can better optimise the array version of vec4f
?
Any Linux specific tips is helpful too, that's where the real code will run.
This LockLess
article on Auto-vectorization with gcc 4.7 is hands down the best article I have ever seen and I have spent a while looking for good articles on similar topics. They also have a lot of other articles that you may find very useful on similar subjects dealing all manners of low level software development.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With