I am trying to improve this code with the SSE4 dot product but I am having a hard time finding a solution. This function gets the parameters qi and tj which contain float arrays with 80 cell each and then calculate the dot product. The return value is a vector with four dot products. So I what I'm trying to do is calculating four dot products of twenty values in parallel.
Have you any idea how to improve this code?
inline __m128 ScalarProd20Vec(__m128* qi, __m128* tj)
{
__m128 res=_mm_add_ps(_mm_mul_ps(tj[0],qi[0]),_mm_mul_ps(tj[1],qi[1]));
res=_mm_add_ps(res,_mm_add_ps(_mm_mul_ps(tj[2],qi[2]),_mm_mul_ps(tj[3],qi[3])));
res=_mm_add_ps(res,_mm_add_ps(_mm_mul_ps(tj[4],qi[4]),_mm_mul_ps(tj[5],qi[5])));
res=_mm_add_ps(res,_mm_add_ps(_mm_mul_ps(tj[6],qi[6]),_mm_mul_ps(tj[7],qi[7])));
res=_mm_add_ps(res,_mm_add_ps(_mm_mul_ps(tj[8],qi[8]),_mm_mul_ps(tj[9],qi[9])));
res=_mm_add_ps(res,_mm_add_ps(_mm_mul_ps(tj[10],qi[10]),_mm_mul_ps(tj[11],qi[11])));
res=_mm_add_ps(res,_mm_add_ps(_mm_mul_ps(tj[12],qi[12]),_mm_mul_ps(tj[13],qi[13])));
res=_mm_add_ps(res,_mm_add_ps(_mm_mul_ps(tj[14],qi[14]),_mm_mul_ps(tj[15],qi[15])));
res=_mm_add_ps(res,_mm_add_ps(_mm_mul_ps(tj[16],qi[16]),_mm_mul_ps(tj[17],qi[17])));
res=_mm_add_ps(res,_mm_add_ps(_mm_mul_ps(tj[18],qi[18]),_mm_mul_ps(tj[19],qi[19])));
return res;
}
Of the hundreds of SSE examples I've seen on SO, your code is one of the few that's already in pretty good shape from the start. You don't need the SSE4 dot-product instruction. (You can do better!)
However, there is one thing you can try: (I say try because I haven't timed it yet.)
Currently you have a data-dependency chain on res
. Vector addition is 3-4 cycles on most machines today. So your code will take a minimum of 30 cycles to run since you have:
(10 additions on critical path) * (3 cycles addps latency) = 30 cycles
What you can do is to node-split the res
variable as follows:
__m128 res0 = _mm_add_ps(_mm_mul_ps(tj[ 0],qi[ 0]),_mm_mul_ps(tj[ 1],qi[ 1]));
__m128 res1 = _mm_add_ps(_mm_mul_ps(tj[ 2],qi[ 2]),_mm_mul_ps(tj[ 3],qi[ 3]));
res0 = _mm_add_ps(res0,_mm_add_ps(_mm_mul_ps(tj[ 4],qi[ 4]),_mm_mul_ps(tj[ 5],qi[ 5])));
res1 = _mm_add_ps(res1,_mm_add_ps(_mm_mul_ps(tj[ 6],qi[ 6]),_mm_mul_ps(tj[ 7],qi[ 7])));
res0 = _mm_add_ps(res0,_mm_add_ps(_mm_mul_ps(tj[ 8],qi[ 8]),_mm_mul_ps(tj[ 9],qi[ 9])));
res1 = _mm_add_ps(res1,_mm_add_ps(_mm_mul_ps(tj[10],qi[10]),_mm_mul_ps(tj[11],qi[11])));
res0 = _mm_add_ps(res0,_mm_add_ps(_mm_mul_ps(tj[12],qi[12]),_mm_mul_ps(tj[13],qi[13])));
res1 = _mm_add_ps(res1,_mm_add_ps(_mm_mul_ps(tj[14],qi[14]),_mm_mul_ps(tj[15],qi[15])));
res0 = _mm_add_ps(res0,_mm_add_ps(_mm_mul_ps(tj[16],qi[16]),_mm_mul_ps(tj[17],qi[17])));
res1 = _mm_add_ps(res1,_mm_add_ps(_mm_mul_ps(tj[18],qi[18]),_mm_mul_ps(tj[19],qi[19])));
return _mm_add_ps(res0,res1);
This almost cuts your critical path in half. Note that because of floating-point non-associativity, this optimization is illegal for compilers to do.
Here's an alternative version using 4-way node-splitting and AMD FMA4 instructions. If you can't use the fused-multiply adds, feel free to split them up. It might still be better than the first version above.
__m128 res0 = _mm_mul_ps(tj[ 0],qi[ 0]);
__m128 res1 = _mm_mul_ps(tj[ 1],qi[ 1]);
__m128 res2 = _mm_mul_ps(tj[ 2],qi[ 2]);
__m128 res3 = _mm_mul_ps(tj[ 3],qi[ 3]);
res0 = _mm_macc_ps(tj[ 4],qi[ 4],res0);
res1 = _mm_macc_ps(tj[ 5],qi[ 5],res1);
res2 = _mm_macc_ps(tj[ 6],qi[ 6],res2);
res3 = _mm_macc_ps(tj[ 7],qi[ 7],res3);
res0 = _mm_macc_ps(tj[ 8],qi[ 8],res0);
res1 = _mm_macc_ps(tj[ 9],qi[ 9],res1);
res2 = _mm_macc_ps(tj[10],qi[10],res2);
res3 = _mm_macc_ps(tj[11],qi[11],res3);
res0 = _mm_macc_ps(tj[12],qi[12],res0);
res1 = _mm_macc_ps(tj[13],qi[13],res1);
res2 = _mm_macc_ps(tj[14],qi[14],res2);
res3 = _mm_macc_ps(tj[15],qi[15],res3);
res0 = _mm_macc_ps(tj[16],qi[16],res0);
res1 = _mm_macc_ps(tj[17],qi[17],res1);
res2 = _mm_macc_ps(tj[18],qi[18],res2);
res3 = _mm_macc_ps(tj[19],qi[19],res3);
res0 = _mm_add_ps(res0,res1);
res2 = _mm_add_ps(res2,res3);
return _mm_add_ps(res0,res2);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With