Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Calculate single-vector Dot Product using SSE intrinsic functions in C

I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the elements of the resulting vector to obtain one number. For instance, the calculation would look like this for the vectors {1,2,3,4} and {5,6,7,8}:

1*5 + 2*6 + 3*7 + 4*8

Essentially, I am taking the dot product of the two vectors. I know there is an SSE command to do this, but the command doesn't have an intrinsic function associated with it. At this point, I don't want to write inline assembly in my C code, so I want to use only intrinsic functions. This seems like a common calculation so I am surprised by myself that I couldn't find the answer on Google.

Note: I am optimizing for a specific micro architecture which supports up to SSE 4.2.

like image 990
Sam Avatar asked Nov 08 '10 01:11

Sam


People also ask

How do you find the dot product of two vectors in component form?

Given these properties and the fact that the dot product is commutative, we can expand the dot product aβ‹…b in terms of components, aβ‹…b=(a1i+a2j+a3k)β‹…(b1i+b2j+b3k)=a1b1iβ‹…i+a2b2jβ‹…j+a3b3kβ‹…k+(a1b2+a2b1)iβ‹…j+(a1b3+a3b1)iβ‹…k+(a2b3+a3b2)jβ‹…k.

How do you find the dot product of a two TWOD vector?

The dot product of two vectors ⃑ 𝑒 and ⃑ 𝑣 equals the product of their magnitudes with the cosine of the angle between them: ⃑ 𝑒 β‹… ⃑ 𝑣 = β€– β€– ⃑ 𝑒 β€– β€– β‹… β€– β€– ⃑ 𝑣 β€– β€– β‹… πœƒ , c o s where πœƒ is the angle between ⃑ 𝑒 and ⃑ 𝑣 .

How do you find the dot product AB of two vectors if you know their components?

When this is the case, we use the formula below to compute the dot product. If we know the components of each vector, for example β†’a=[a1,a2,a3] a β†’ = [ a 1 , a 2 , a 3 ] and β†’b=[b1,b2,b3] b β†’ = [ b 1 , b 2 , b 3 ] , we can use a different method to find the dot product.


2 Answers

If you're doing a dot-product of longer vectors, use multiply and regular _mm_add_ps (or FMA) inside the inner loop. Save the horizontal sum until the end.


But if you are doing a dot product of just a single pair of SIMD vectors:

GCC (at least version 4.3) includes <smmintrin.h> with SSE4.1 level intrinsics, including the single and double-precision dot products:

_mm_dp_ps (__m128 __X, __m128 __Y, const int __M);
_mm_dp_pd (__m128d __X, __m128d __Y, const int __M);

On Intel mainstream CPUs (not Atom/Silvermont) these are somewhat faster than doing it manually with multiple instructions.

But on AMD (including Ryzen), dpps is significantly slower. (See Agner Fog's instruction tables)


As a fallback for older processors, you can use this algorithm to create the dot product of the vectors a and b:

__m128 r1 = _mm_mul_ps(a, b);

and then horizontal sum r1 using Fastest way to do horizontal float vector sum on x86 (see there for a commented version of this, and why it's faster.)

__m128 shuf   = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 3, 0, 1));
__m128 sums   = _mm_add_ps(r1, shuf);
shuf          = _mm_movehl_ps(shuf, sums);
sums          = _mm_add_ss(sums, shuf);
float result =  _mm_cvtss_f32(sums);

A slow alternative costs 2 shuffles per hadd, which will easily bottleneck on shuffle throughput, especially on Intel CPUs.

r2 = _mm_hadd_ps(r1, r1);
r3 = _mm_hadd_ps(r2, r2);
_mm_store_ss(&result, r3);
like image 166
caf Avatar answered Sep 19 '22 19:09

caf


I'd say the fastest SSE method would be:

static inline float CalcDotProductSse(__m128 x, __m128 y) {
    __m128 mulRes, shufReg, sumsReg;
    mulRes = _mm_mul_ps(x, y);

    // Calculates the sum of SSE Register - https://stackoverflow.com/a/35270026/195787
    shufReg = _mm_movehdup_ps(mulRes);        // Broadcast elements 3,1 to 2,0
    sumsReg = _mm_add_ps(mulRes, shufReg);
    shufReg = _mm_movehl_ps(shufReg, sumsReg); // High Half -> Low Half
    sumsReg = _mm_add_ss(sumsReg, shufReg);
    return  _mm_cvtss_f32(sumsReg); // Result in the lower part of the SSE Register
}

I followed - Fastest Way to Do Horizontal Float Vector Sum On x86.

like image 32
Royi Avatar answered Sep 20 '22 19:09

Royi