Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to subtract packed unsigned doublewords, saturated, on x86, using MMX/SSE?

I've been looking at MMX/SSE and I am wondering. There are instructions for packed, saturated subtraction of unsigned bytes and words, but not doublewords.

Is there a way of doing what I want, or if not, why is there none?

like image 738
z0rberg's Avatar asked Oct 15 '22 14:10

z0rberg's


1 Answers

If you have SSE4.1 available, I don't think you can get better than using the pmaxud+psubd approach suggested by @harold. With AVX2, you can of course also use the corresponding 256bit variants.

__m128i subs_epu32_sse4(__m128i a, __m128i b){
    __m128i mx = _mm_max_epu32(a,b);
    return _mm_sub_epi32(mx, b);
}

Without SSE4.1, you need to compare both arguments in some way. Unfortunately, there is no epu32 comparison (not before AVX512), but you can simulate one by first adding 0x80000000 (which is equivalent to xor-ing in this case) to both arguments:

__m128i cmpgt_epu32(__m128i a, __m128i b) {
    const __m128i highest = _mm_set1_epi32(0x80000000);
    return _mm_cmpgt_epi32(_mm_xor_si128(a,highest),_mm_xor_si128(b,highest));
}

__m128i subs_epu32(__m128i a, __m128i b){
    __m128i not_saturated = cmpgt_epu32(a,b);
    return _mm_and_si128(not_saturated, _mm_sub_epi32(a,b));
}

In some cases, it might be better to replace the comparison by some bit-twiddling of the highest bit and broadcasting that to every bit using a shift (this replaces a pcmpgtd and three bit-logic operations (and having to load 0x80000000 at least once) by a psrad and five bit-logic operations):

__m128i subs_epu32_(__m128i a, __m128i b) {
    __m128i r = _mm_sub_epi32(a,b);
    __m128i c = (~a & b) | (r & ~(a^b)); // works with gcc/clang. Replace by corresponding intrinsics, if necessary (note that `andnot` is a single instruction)
    return _mm_srai_epi32(c,31) & r;
}

Godbolt-Link, also including adds_epu32 variants: https://godbolt.org/z/n4qaW1 Strangely, clang needs more register copies than gcc for the non-SSE4.1 variants. On the other hand, clang finds the pmaxud optimization for the cmpgt_epu32 variant when compiled with SSE4.1: https://godbolt.org/z/3o5KCm

like image 84
chtz Avatar answered Oct 21 '22 16:10

chtz