I'm looking for the most efficient method of flipping the sign on all four floats packed in an SSE register.
I have not found an intrinsic for doing this in the Intel Architecture software dev manual. Below are the things I've already tried.
For each case I looped over the code 10 billion times and got the wall-time indicated. I'm trying to at least match 4 seconds it takes my non-SIMD approach, which is using just the unary minus operator.
[48 sec]_mm_sub_ps( _mm_setzero_ps(), vec );
[32 sec]_mm_mul_ps( _mm_set1_ps( -1.0f ), vec );
[9 sec]
union NegativeMask { int intRep; float fltRep; } negMask; negMask.intRep = 0x80000000; _mm_xor_ps( _mm_set1_ps( negMask.fltRep ), vec );
The compiler is gcc 4.2 with -O3. The CPU is an Intel Core 2 Duo.
That union is not really needed, best of all worlds (readability, speed and portability):
_mm_xor_ps(vec, _mm_set1_ps(-0.f))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With