I need to write a dot product using SSE2 (no _mm_dp_ps nor _mm_hadd_ps) :
#include <xmmintrin.h>
inline __m128 sse_dot4(__m128 a, __m128 b)
{
const __m128 mult = _mm_mul_ps(a, b);
const __m128 shuf1 = _mm_shuffle_ps(mult, mult, _MM_SHUFFLE(0, 3, 2, 1));
const __m128 shuf2 = _mm_shuffle_ps(mult,mult, _MM_SHUFFLE(1, 0, 3, 2));
const __m128 shuf3 = _mm_shuffle_ps(mult,mult, _MM_SHUFFLE(2, 1, 0, 3));
return _mm_add_ss(_mm_add_ss(_mm_add_ss(mult, shuf1), shuf2), shuf3);
}
but I looked at the generated assembler with gcc 4.9 (experimental) -O3, and I get :
mulps %xmm1, %xmm0
movaps %xmm0, %xmm3 //These lines
movaps %xmm0, %xmm2 //have no use
movaps %xmm0, %xmm1 //isn't it ?
shufps $57, %xmm0, %xmm3
shufps $78, %xmm0, %xmm2
shufps $147, %xmm0, %xmm1
addss %xmm3, %xmm0
addss %xmm2, %xmm0
addss %xmm1, %xmm0
ret
I am wondering why gcc copy xmm0 in xmm1, 2 and 3... Here is the code I get using the flag : -march=native (looks better)
vmulps %xmm1, %xmm0, %xmm1
vshufps $78, %xmm1, %xmm1, %xmm2
vshufps $57, %xmm1, %xmm1, %xmm3
vshufps $147, %xmm1, %xmm1, %xmm0
vaddss %xmm3, %xmm1, %xmm1
vaddss %xmm2, %xmm1, %xmm1
vaddss %xmm0, %xmm1, %xmm0
ret
Here's a dot product using only original SSE instructions, that also swizzles the result across each element:
inline __m128 sse_dot4(__m128 v0, __m128 v1)
{
v0 = _mm_mul_ps(v0, v1);
v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(2, 3, 0, 1));
v0 = _mm_add_ps(v0, v1);
v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(0, 1, 2, 3));
v0 = _mm_add_ps(v0, v1);
return v0;
}
It's 5 SIMD instructions (as opposed to 7), though with no real opportunity to hide latencies. Any element will hold the result, e.g., float f = _mm_cvtss_f32(sse_dot4(a, b);
the haddps
instruction has pretty awful latency. With SSE3:
inline __m128 sse_dot4(__m128 v0, __m128 v1)
{
v0 = _mm_mul_ps(v0, v1);
v0 = _mm_hadd_ps(v0, v0);
v0 = _mm_hadd_ps(v0, v0);
return v0;
}
This is possibly slower, though it's only 3 SIMD instructions. If you can do more than one dot product at a time, you could interleave instructions in the first case. Shuffle is very fast on more recent micro-architectures.
The first listing you paste is for SSE architectures only. Most SSE instructions support only the two operand syntax: instructions are in the form of a = a OP b
.
In your code, a
is mult
. So if no copy is made and passes mult
(xmm0
in your example) directly, its value will be overwritten and then lost for the remaining _mm_shuffle_ps
instructions
By passing march=native
in the second listing, you enabled AVX instructions. AVX enables SSE intructions to use the three operand syntax: c = a OP b
. In this case, none of the source operands has to be overwritten so you do not need the additional copies.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With