I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the elements of the resulting vector to obtain one number. For instance, the calculation would look like this for the vectors {1,2,3,4} and {5,6,7,8}:
1*5 + 2*6 + 3*7 + 4*8
Essentially, I am taking the dot product of the two vectors. I know there is an SSE command to do this, but the command doesn't have an intrinsic function associated with it. At this point, I don't want to write inline assembly in my C code, so I want to use only intrinsic functions. This seems like a common calculation so I am surprised by myself that I couldn't find the answer on Google.
Note: I am optimizing for a specific micro architecture which supports up to SSE 4.2.
Given these properties and the fact that the dot product is commutative, we can expand the dot product aβ b in terms of components, aβ b=(a1i+a2j+a3k)β (b1i+b2j+b3k)=a1b1iβ i+a2b2jβ j+a3b3kβ k+(a1b2+a2b1)iβ j+(a1b3+a3b1)iβ k+(a2b3+a3b2)jβ k.
The dot product of two vectors β π’ and β π£ equals the product of their magnitudes with the cosine of the angle between them: β π’ β β π£ = β β β π’ β β β β β β π£ β β β π , c o s where π is the angle between β π’ and β π£ .
When this is the case, we use the formula below to compute the dot product. If we know the components of each vector, for example βa=[a1,a2,a3] a β = [ a 1 , a 2 , a 3 ] and βb=[b1,b2,b3] b β = [ b 1 , b 2 , b 3 ] , we can use a different method to find the dot product.
If you're doing a dot-product of longer vectors, use multiply and regular _mm_add_ps
(or FMA) inside the inner loop. Save the horizontal sum until the end.
But if you are doing a dot product of just a single pair of SIMD vectors:
GCC (at least version 4.3) includes <smmintrin.h>
with SSE4.1 level intrinsics, including the single and double-precision dot products:
_mm_dp_ps (__m128 __X, __m128 __Y, const int __M);
_mm_dp_pd (__m128d __X, __m128d __Y, const int __M);
On Intel mainstream CPUs (not Atom/Silvermont) these are somewhat faster than doing it manually with multiple instructions.
But on AMD (including Ryzen), dpps
is significantly slower. (See Agner Fog's instruction tables)
As a fallback for older processors, you can use this algorithm to create the dot product of the vectors a
and b
:
__m128 r1 = _mm_mul_ps(a, b);
and then horizontal sum r1
using Fastest way to do horizontal float vector sum on x86 (see there for a commented version of this, and why it's faster.)
__m128 shuf = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 3, 0, 1));
__m128 sums = _mm_add_ps(r1, shuf);
shuf = _mm_movehl_ps(shuf, sums);
sums = _mm_add_ss(sums, shuf);
float result = _mm_cvtss_f32(sums);
A slow alternative costs 2 shuffles per hadd
, which will easily bottleneck on shuffle throughput, especially on Intel CPUs.
r2 = _mm_hadd_ps(r1, r1);
r3 = _mm_hadd_ps(r2, r2);
_mm_store_ss(&result, r3);
I'd say the fastest SSE method would be:
static inline float CalcDotProductSse(__m128 x, __m128 y) {
__m128 mulRes, shufReg, sumsReg;
mulRes = _mm_mul_ps(x, y);
// Calculates the sum of SSE Register - https://stackoverflow.com/a/35270026/195787
shufReg = _mm_movehdup_ps(mulRes); // Broadcast elements 3,1 to 2,0
sumsReg = _mm_add_ps(mulRes, shufReg);
shufReg = _mm_movehl_ps(shufReg, sumsReg); // High Half -> Low Half
sumsReg = _mm_add_ss(sumsReg, shufReg);
return _mm_cvtss_f32(sumsReg); // Result in the lower part of the SSE Register
}
I followed - Fastest Way to Do Horizontal Float Vector Sum On x86.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With