I've been encountering a very subtle issue on SSE. Here is the case, I want to optimise my ray tracer with SSE so that I can get a basic feeling how to improve the performance with SSE.
I'd like to start with this very function.
Vector3f Add( const Vector3f& v0 , Vector3f& v1 );
(Actually I tried to optimise CrossProduct first, adding is shown here for simplicity and I knew it is not the bottleneck of my ray tracer.)
Here is a part of the definition of the struct:
struct Vector3f
{ union { struct{ float x ; float y ; float z; float reserved; }; __m128 data; };
The issue is there will be SSE register flush with this very declaration, the compiler is not smart enough to hold those sse register for further uses. And with the following declaration, it avoids the flushing.
__m128 Add( __m128 v0_data, __m128 v1_data );
I can go with this way on this case, however it would be ugly design for Matrix which holds four __m128 data. And you can't have operator works on the Vector3f itself but on its data, :(.
The most disturbing thing is that you will have to change your higher level code everywhere to adapt the change. And this way of optimisation through SSE is definitely no option for something large like a huge game engine, you'll change huge amount of code before it works.
Without avoiding the SSE register flushing, its power will be drained out by those useless flushing command which renders SSE useless, I guess.
It seems that union is a bad thing to use here. As long as a compiler sees __m128
unified with something, it has problems with understanding when to update values, leading to excessive memory operations.
MSVC is not the worst performing compiler in this situation. Just check the code generated by GCC 5.1.0, it works 12 times slower than the code generated by MSVC2013 (which is with registers spilling) on my machine, and 20+ times slower than the optimal code.
It is interesting that most compilers start doing silly things only when you really use x
, y
, z
members to access your data. For instance, MSVC2013 spills registers only when you read them via scalar members after computation (I guess to make sure these members are actual). The terrible behavior of GCC seen above disappears if you set initial values with _mm_setr_ps
instead of writing them to directly into members.
It is better to avoid unions in this case. It seems that OP has come to the same decision (see current Vector3fv code). Making it harder to access a single coordinate has a good "psychological" performance effect: a person would think twice before writing scalar code. You can easily write setters/getters either with extract/insert intrinsics (which makes compiler generate these instructions), or with simple pointer arithmetic (which makes compiler choose some way):
float getX() const { return ((float*)&data)[0]; }
When I remove union and simply use __m128
, the generated code becomes better on all compilers. However, MSVC2013 still has unnecessary moves: one useless register move per each arithmetic operation. I suppose this is an inefficiency in the compiler's inlining algorithm. You can remove these moves in MSVC2013 by declaring all your functions as __vectorcall. Note that using this new calling convention also allows you to avoid register spilling in case your simd functions have not been inlined at all.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With