How can I add together two SSE registers

Tags:

I have two SSE registers (128 bits is one register) and I want to add them up. I know how I can add corresponding words in them, for example I can do it with _mm_add_epi16 if I use 16bit words in registers, but what I want is something like _mm_add_epi128 (which does not exist), which would use register as one big word. Is there any way to perform this operation, even if multiple instructions are needed?
I was thinking about using _mm_add_epi64, detecting overflow in the right word and then adding 1 to the left word in register if needed, but I would also like this approach to work for 256bit registers (AVX2), and this approach seems too complicated for that.

552

asked Jun 11 '14 11:06

Martinsos

1 Answers

To add two 128-bit numbers x and y to give z with SSE you can do it like this

z = _mm_add_epi64(x,y);
c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
z = _mm_sub_epi64(z,c);

This is based on this link how-can-i-add-and-subtract-128-bit-integers-in-c-or-c.

The function unsigned_lessthan is defined below. It's complicated without AMD XOP (actually a found a simpler version for SSE4.2 if XOP is not available - see the end of my answer). Probably some of the other people here can suggest a better method. Here is some code showing this works.

#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>

inline __m128i unsigned_lessthan(__m128i a, __m128i b) {
#ifdef __XOP__  // AMD XOP instruction set
    return _mm_comgt_epu64(b,a));
#else  // SSE2 instruction set
    __m128i sign32  = _mm_set1_epi32(0x80000000);          // sign bit of each dword
    __m128i aflip   = _mm_xor_si128(b,sign32);             // a with sign bits flipped
    __m128i bflip   = _mm_xor_si128(a,sign32);             // b with sign bits flipped
    __m128i equal   = _mm_cmpeq_epi32(b,a);                // a == b, dwords
    __m128i bigger  = _mm_cmpgt_epi32(aflip,bflip);        // a > b, dwords
    __m128i biggerl = _mm_shuffle_epi32(bigger,0xA0);      // a > b, low dwords copied to high dwords
    __m128i eqbig   = _mm_and_si128(equal,biggerl);        // high part equal and low part bigger
    __m128i hibig   = _mm_or_si128(bigger,eqbig);          // high part bigger or high part equal and low part
    __m128i big     = _mm_shuffle_epi32(hibig,0xF5);       // result copied to low part
    return big;
#endif
}

int main() {
    __m128i x,y,z,c;
    x = _mm_set_epi64x(3,0xffffffffffffffffll);
    y = _mm_set_epi64x(1,0x2ll);
    z = _mm_add_epi64(x,y);
    c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
    z = _mm_sub_epi64(z,c);

    int out[4];
    //int64_t out[2];
    _mm_storeu_si128((__m128i*)out, z);
    printf("%d %d\n", out[2], out[0]);
}

Edit:

The only potentially efficient way to add 128-bit or 256-bit numbers with SSE is with XOP. The only option with AVX would be XOP2 which does not exist yet. And even if you have XOP it may only be efficient to add two 128-bit or 256-numbers in parallel (you could do four with AVX if XOP2 existed) to avoid the horizontal instructions such as mm_unpacklo_epi64.

The best solution in general is to push the registers onto the stack and use scalar arithmetic. Assuming you have two 256-bit registers x4 and y4 you can add them like this:

__m256i x4, y4, z4;

uint64_t x[4], uint64_t y[4], uint64_t z[4]    
_mm256_storeu_si256((__m256i*)x, x4);
_mm256_storeu_si256((__m256i*)y, y4);
add_u256(x,y,z);
z4 = _mm256_loadu_si256((__m256i*)z);

void add_u256(uint64_t x[4], uint64_t y[4], uint64_t z[4]) {
    uint64_t c1 = 0, c2 = 0, tmp;
    //add low 128-bits
    z[0] = x[0] + y[0];
    z[1] = x[1] + y[1];
    c1 += z[1]<x[1];
    tmp = z[1];
    z[1] += z[0]<x[0];
    c1 += z[1]<tmp;
    //add high 128-bits + carry from low 128-bits
    z[2] = x[2] + y[2];
    c2 += z[2]<x[2];
    tmp = z[2];
    z[2] += c1;
    c2 += z[2]<tmp; 
    z[3] = x[3] + y[3] + c2;
}

int main() {
    uint64_t x[4], y[4], z[4];
    x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
    y[0] = 1; y[1] = 1; y[2] = 1; y[3] = 1;
    //z = x + y  (x3,x2,x1,x0) = (2,3,1,0)
    //x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
    //y[0] = 1; y[1] = 0; y[2] = 1; y[3] = 1;
    //z = x + y  (x3,x2,x1,x0) = (2,3,0,0)
    add_u256(x,y,z);
    for(int i=3; i>=0; i--) printf("%u ", z[i]); printf("\n");
}

Edit: based on a comment by Stephen Canon at saturated-substraction-avx-or-sse4-2 I discovered there is a more efficient way to compare unsigned 64-bit numbers with SSE4.2 if XOP is not available.

__m128i a,b;
__m128i sign64 = _mm_set1_epi64x(0x8000000000000000L);
__m128i aflip = _mm_xor_si128(a, sign64);
__m128i bflip = _mm_xor_si128(b, sign64);
__m128i cmp = _mm_cmpgt_epi64(aflip,bflip);

181

answered Sep 29 '22 13:09

Z boson

Related questions
                            
                                How to write VLC plugin that can interact with the operating system
                            
                                Converting glm::lookat matrix to quaternion and back
                            
                                data member 'vec' cannot be a member template
                            
                                Why are C++ exception specifications not checked at compile-time?
                            
                                How does getInstance() work?
                            
                                Proper way to generate a random float given a binary random number generator?
                            
                                How to friend a specific template specialization?
                            
                                Android only game in OpenGL: performance in C++ (NDK) vs Java (Dalvik) [closed]
                            
                                c++ wrapping types for semantic
                            
                                Difference between T[N] and T[] in template specializations?
                            
                                How do I capture a smart pointer in a lambda?
                            
                                Why does member `float x` get initialized with `0.` for the objects `a` and `b` in main()? [duplicate]
                            
                                Is there any way to get the caller of the CallExpr* in VisitCallExpr method with clang?
                            
                                Why do template template parameters with constraints require stricter arguments?
                            
                                LCOV branches at the end of a function
                            
                                CUDA, mutex and atomicCAS()
                            
                                OpenCV how to use the KalmanFilter class as ExtendedKF
                            
                                boost::lockfree::spsc_queue busy wait strategy. Is there a blocking pop?
                            
                                Entry point for MFC application
                            
                                VS2013 default initialization vs value initialization

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I add together two SSE registers

Tags:

c++

c

intel

sse

avx2

Martinsos

People also ask

1 Answers

Z boson

Recent Activity

Donate For Us