I'm looking to understand SSE2's capabilities a little more, and would like to know if one could make a 128-bit wide integer that supports addition, subtraction, XOR and multiplication?

SIMD is meant to work on multiple small values at the same time, hence there won't be any carry over to the higher unit and you must do that manually. In SSE2 there's no carry flag but you can easily calculate the carry as <code>carry = sum < a</code> or <code>carry = sum < b</code> like this. Worse yet, SSE2 doesn't have 64-bit comparisons either, so you must use some workaround like the one here Here is an untested, unoptimized C code based on the idea above: <pre class="prettyprint lang-c prettyprint-override"><code>inline bool lessthan(__m128i a, __m128i b){ a = _mm_xor_si128(a, _mm_set1_epi32(0x80000000)); b = _mm_xor_si128(b, _mm_set1_epi32(0x80000000)); __m128i t = _mm_cmplt_epi32(a, b); __m128i u = _mm_cmpgt_epi32(a, b); __m128i z = _mm_or_si128(t, _mm_shuffle_epi32(t, 177)); z = _mm_andnot_si128(_mm_shuffle_epi32(u, 245),z); return _mm_cvtsi128_si32(z) & 1; } inline __m128i addi128(__m128i a, __m128i b) { __m128i sum = _mm_add_epi64(a, b); __m128i mask = _mm_set1_epi64(0x8000000000000000); if (lessthan(_mm_xor_si128(mask, sum), _mm_xor_si128(mask, a))) { __m128i ONE = _mm_setr_epi64(0, 1); sum = _mm_add_epi64(sum, ONE); } return sum; } </code></pre> As you can see, the code requires many more instructions and even after optimizing it may still be much longer than a simple 2 ADD/ADC pair in x86_64 (or 4 instructions in x86) <hr> SSE2 will help though, if you have multiple 128-bit integers to add in parallel. However you need to arrange the high and low parts of the values properly so that we can add all the low parts at once, and all the high parts at once See also <ul> <li>practical BigNum AVX/SSE possible?</li> <li>Can long integer routines benefit from SSE?</li> </ul>

Is it possible to use SSE and SSE2 to make a 128-bit wide integer?

1 Answers

SIMD is meant to work on multiple small values at the same time, hence there won't be any carry over to the higher unit and you must do that manually. In SSE2 there's no carry flag but you can easily calculate the carry as carry = sum < a or carry = sum < b like this. Worse yet, SSE2 doesn't have 64-bit comparisons either, so you must use some workaround like the one here

Here is an untested, unoptimized C code based on the idea above:

inline bool lessthan(__m128i a, __m128i b){
    a = _mm_xor_si128(a, _mm_set1_epi32(0x80000000));
    b = _mm_xor_si128(b, _mm_set1_epi32(0x80000000));
    __m128i t = _mm_cmplt_epi32(a, b);
    __m128i u = _mm_cmpgt_epi32(a, b);
    __m128i z = _mm_or_si128(t, _mm_shuffle_epi32(t, 177));
    z = _mm_andnot_si128(_mm_shuffle_epi32(u, 245),z);
    return _mm_cvtsi128_si32(z) & 1;
}

inline __m128i addi128(__m128i a, __m128i b)
{
    __m128i sum = _mm_add_epi64(a, b);
    __m128i mask = _mm_set1_epi64(0x8000000000000000);    
    if (lessthan(_mm_xor_si128(mask, sum), _mm_xor_si128(mask, a)))
    {
        __m128i ONE = _mm_setr_epi64(0, 1);
        sum = _mm_add_epi64(sum, ONE);
    }

    return sum;
}

As you can see, the code requires many more instructions and even after optimizing it may still be much longer than a simple 2 ADD/ADC pair in x86_64 (or 4 instructions in x86)

SSE2 will help though, if you have multiple 128-bit integers to add in parallel. However you need to arrange the high and low parts of the values properly so that we can add all the low parts at once, and all the high parts at once

phuclv

Related questions
                            
                                Is it possible to debug x64 assembly on Mac OS?
                            
                                Why do program-level constructors get called by `__libc_csu_init` but destructors don't get called by `__libc_csu_fini`?
                            
                                Why does the compiler generate such code when initializing a volatile array?
                            
                                Exact copy of machine code runs 50% slower than the original function
                            
                                Building a Control-flow Graph using results from Objdump
                            
                                Speed up x64 assembler ADD loop
                            
                                How much does function alignment actually matter on modern processors?
                            
                                What's the difference between the .asciz and the .string assembler directives?
                            
                                static code analysis for assembly language
                            
                                How can I perform 64-bit division with a 32-bit divide instruction?
                            
                                What bytecode library when controlling line numbers?
                            
                                ASM call conventions
                            
                                Iterating through and modifying a string in MIPS
                            
                                How can I create a parallel stack and run a coroutine on it?
                            
                                x86 assembly instruction: call *Reg
                            
                                Int to Float to Int conversion precision loss
                            
                                Waiting for a change on $D012 (C64 assembler)
                            
                                Why doesn't time() from time.h have a syscall to sys_time?
                            
                                Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?
                            
                                Defining Bytes in GCC Inline Assembly in Dev-C++(.ascii in AT&T syntax on Windows)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to use SSE and SSE2 to make a 128-bit wide integer?

Tags:

assembly

sse

sse2

Erkling

People also ask

1 Answers

phuclv

Recent Activity

Donate For Us