How to Calculate single-vector Dot Product using SSE intrinsic functions in C

Tags:

I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the elements of the resulting vector to obtain one number. For instance, the calculation would look like this for the vectors {1,2,3,4} and {5,6,7,8}:

1*5 + 2*6 + 3*7 + 4*8

Essentially, I am taking the dot product of the two vectors. I know there is an SSE command to do this, but the command doesn't have an intrinsic function associated with it. At this point, I don't want to write inline assembly in my C code, so I want to use only intrinsic functions. This seems like a common calculation so I am surprised by myself that I couldn't find the answer on Google.

Note: I am optimizing for a specific micro architecture which supports up to SSE 4.2.

990

asked Nov 08 '10 01:11

Sam

2 Answers

If you're doing a dot-product of longer vectors, use multiply and regular _mm_add_ps (or FMA) inside the inner loop. Save the horizontal sum until the end.

But if you are doing a dot product of just a single pair of SIMD vectors:

GCC (at least version 4.3) includes <smmintrin.h> with SSE4.1 level intrinsics, including the single and double-precision dot products:

_mm_dp_ps (__m128 __X, __m128 __Y, const int __M);
_mm_dp_pd (__m128d __X, __m128d __Y, const int __M);

On Intel mainstream CPUs (not Atom/Silvermont) these are somewhat faster than doing it manually with multiple instructions.

But on AMD (including Ryzen), dpps is significantly slower. (See Agner Fog's instruction tables)

As a fallback for older processors, you can use this algorithm to create the dot product of the vectors a and b:

__m128 r1 = _mm_mul_ps(a, b);

and then horizontal sum r1 using Fastest way to do horizontal float vector sum on x86 (see there for a commented version of this, and why it's faster.)

__m128 shuf   = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 3, 0, 1));
__m128 sums   = _mm_add_ps(r1, shuf);
shuf          = _mm_movehl_ps(shuf, sums);
sums          = _mm_add_ss(sums, shuf);
float result =  _mm_cvtss_f32(sums);

A slow alternative costs 2 shuffles per hadd, which will easily bottleneck on shuffle throughput, especially on Intel CPUs.

r2 = _mm_hadd_ps(r1, r1);
r3 = _mm_hadd_ps(r2, r2);
_mm_store_ss(&result, r3);

166

answered Sep 19 '22 19:09

caf

I'd say the fastest SSE method would be:

static inline float CalcDotProductSse(__m128 x, __m128 y) {
    __m128 mulRes, shufReg, sumsReg;
    mulRes = _mm_mul_ps(x, y);

    // Calculates the sum of SSE Register - https://stackoverflow.com/a/35270026/195787
    shufReg = _mm_movehdup_ps(mulRes);        // Broadcast elements 3,1 to 2,0
    sumsReg = _mm_add_ps(mulRes, shufReg);
    shufReg = _mm_movehl_ps(shufReg, sumsReg); // High Half -> Low Half
    sumsReg = _mm_add_ss(sumsReg, shufReg);
    return  _mm_cvtss_f32(sumsReg); // Result in the lower part of the SSE Register
}

I followed - Fastest Way to Do Horizontal Float Vector Sum On x86.

answered Sep 20 '22 19:09

Royi

Related questions
                            
                                What does "#define STR(a) #a" do?
                            
                                Why does this makefile execute a target on 'make clean'
                            
                                How to achieve lock-free, but blocking behavior?
                            
                                How to use a C++ member function as the callback function for a C framework
                            
                                defining and iterating through array of strings in c
                            
                                TCP Connect error 115 Operation in Progress What is the Cause?
                            
                                Warning: case not evaluated in enumerated type?
                            
                                Array format for #define (C preprocessor)
                            
                                sublime text 2 build system for C programming language [closed]
                            
                                Why is discarding the volatile qualifier in a function call a warning?
                            
                                Does "int (*)[]" decay into "int **" in a function parameter?
                            
                                Fast CRC algorithm?
                            
                                Standard function to replace character or substring in a char array? [duplicate]
                            
                                What is the point of saying "#define FOO FOO" in C?
                            
                                Why does an 8-bit field have endianness?
                            
                                How to write my own printf() in C?
                            
                                How do I output my host’s IP addresses from a C program?
                            
                                How does one go about understanding GNU source code?
                            
                                multiple word string input through scanf( )
                            
                                Best debugging tool for C and C++ [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to Calculate single-vector Dot Product using SSE intrinsic functions in C

Tags:

c

optimization

vectorization

simd

sse

Sam

People also ask

2 Answers

caf

Royi

Recent Activity

Donate For Us