I'm looking to understand SSE2's capabilities a little more, and would like to know if one could make a 128-bit wide integer that supports addition, subtraction, XOR and multiplication?
DrinkMoreBoilingWater's blog. As an extension the integer scalar type __int128 is supported for targets which have an integer mode wide enough to hold 128 bits. Simply write __int128 for a signed 128-bit integer, or unsigned __int128 for an unsigned 128-bit integer.
The 128-bit data type can handle up to 31 significant digits (compared to 17 handled by the 64-bit long double).
In computer architecture, 128-bit integers, memory addresses, or other data units are those that are 128 bits (16 octets) wide. Also, 128-bit central processing unit (CPU) and arithmetic logic unit (ALU) architectures are those that are based on registers, address buses, or data buses of that size.
SIMD is meant to work on multiple small values at the same time, hence there won't be any carry over to the higher unit and you must do that manually. In SSE2 there's no carry flag but you can easily calculate the carry as carry = sum < a
or carry = sum < b
like this. Worse yet, SSE2 doesn't have 64-bit comparisons either, so you must use some workaround like the one here
Here is an untested, unoptimized C code based on the idea above:
inline bool lessthan(__m128i a, __m128i b){
a = _mm_xor_si128(a, _mm_set1_epi32(0x80000000));
b = _mm_xor_si128(b, _mm_set1_epi32(0x80000000));
__m128i t = _mm_cmplt_epi32(a, b);
__m128i u = _mm_cmpgt_epi32(a, b);
__m128i z = _mm_or_si128(t, _mm_shuffle_epi32(t, 177));
z = _mm_andnot_si128(_mm_shuffle_epi32(u, 245),z);
return _mm_cvtsi128_si32(z) & 1;
}
inline __m128i addi128(__m128i a, __m128i b)
{
__m128i sum = _mm_add_epi64(a, b);
__m128i mask = _mm_set1_epi64(0x8000000000000000);
if (lessthan(_mm_xor_si128(mask, sum), _mm_xor_si128(mask, a)))
{
__m128i ONE = _mm_setr_epi64(0, 1);
sum = _mm_add_epi64(sum, ONE);
}
return sum;
}
As you can see, the code requires many more instructions and even after optimizing it may still be much longer than a simple 2 ADD/ADC pair in x86_64 (or 4 instructions in x86)
SSE2 will help though, if you have multiple 128-bit integers to add in parallel. However you need to arrange the high and low parts of the values properly so that we can add all the low parts at once, and all the high parts at once
See also
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With