SSE2 intrinsics - comparing unsigned integers

Question

I'm interested in identifying overflowing values when adding unsigned 8-bit integers, and clamping the result to 0xFF:

__m128i m1 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);
__m128i m2 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);

__m128i m3 = _mm_adds_epu8(m1, m2);

I would be interested in performing comparison for "less than" on these unsigned integers, similar to _mm_cmplt_epi8 for signed:

__m128i mask = _mm_cmplt_epi8 (m3, m1);
m1 = _mm_or_si128(m3, mask);

If an "epu8" equivalent was available, mask would have 0xFF where m3[i] < m1[i] (overflow!), 0x00 otherwise, and we would be able to clamp m1 using the "or", so m1 will hold the addition result where valid, and 0xFF where it overflowed.

Problem is, _mm_cmplt_epi8 performs a signed comparison, so for instance if m1[i] = 0x70 and m2[i] = 0x10, then m3[i] = 0x80 and mask[i] = 0xFF, which is obviously not what I require.

Using VS2012.

I would appreciate another approach for performing this. Thanks!

Paul R · Accepted Answer

One way of implementing compares for unsigned 8 bit vectors is to exploit _mm_max_epu8, which returns the maximum of unsigned 8 bit int elements. You can compare for equality the (unsigned) maximum value of two elements with one of the source elements and then return the appropriate result. This translates to 2 instructions for >= or <=, and 3 instructions for > or <.

Example code:

#include <stdio.h>
#include <emmintrin.h>    // SSE2

#define _mm_cmpge_epu8(a, b) \
        _mm_cmpeq_epi8(_mm_max_epu8(a, b), a)

#define _mm_cmple_epu8(a, b) _mm_cmpge_epu8(b, a)

#define _mm_cmpgt_epu8(a, b) \
        _mm_xor_si128(_mm_cmple_epu8(a, b), _mm_set1_epi8(-1))

#define _mm_cmplt_epu8(a, b) _mm_cmpgt_epu8(b, a)

int main(void)
{
    __m128i va = _mm_setr_epi8(0,   0,   1,   1,   1, 127, 127, 127, 128, 128, 128, 254, 254, 254, 255, 255);
    __m128i vb = _mm_setr_epi8(0, 255,   0,   1, 255,   0, 127, 255,   0, 128, 255,   0, 254, 255,   0, 255);

    __m128i v_ge = _mm_cmpge_epu8(va, vb);
    __m128i v_le = _mm_cmple_epu8(va, vb);
    __m128i v_gt = _mm_cmpgt_epu8(va, vb);
    __m128i v_lt = _mm_cmplt_epu8(va, vb);

    printf("va   = %4vhhu
", va);
    printf("vb   = %4vhhu
", vb);
    printf("v_ge = %4vhhu
", v_ge);
    printf("v_le = %4vhhu
", v_le);
    printf("v_gt = %4vhhu
", v_gt);
    printf("v_lt = %4vhhu
", v_lt);

    return 0;
}

Compile and run:

$ gcc -Wall _mm_cmplt_epu8.c && ./a.out 
va   =    0    0    1    1    1  127  127  127  128  128  128  254  254  254  255  255
vb   =    0  255    0    1  255    0  127  255    0  128  255    0  254  255    0  255
v_ge =  255    0  255  255    0  255  255    0  255  255    0  255  255    0  255  255
v_le =  255  255    0  255  255    0  255  255    0  255  255    0  255  255    0  255
v_gt =    0    0  255    0    0  255    0    0  255    0    0  255    0    0  255    0
v_lt =    0  255    0    0  255    0    0  255    0    0  255    0    0  255    0    0

Peter Cordes · Answer

The other answers got me thinking of a simpler method to answer the specific question more directly:

To simply detect clamping, do saturating and non-saturating additions, and compare the results.

__m128i m1 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);
__m128i m2 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);

__m128i m1m2_sat = _mm_adds_epu8(m1, m2);
__m128i m1m2_wrap = _mm_add_epi8(m1, m2);
__m128i non_clipped = _mm_cmpeq_epi8(m1m2_sat, m1m2_wrap);

So that's just two instructions beyond the adds, and one of them can run in parallel with the adds. So the non_clipped mask is ready one cycle after the addition result. (Potentially 3 instructions (an extra movdqa) without AVX 3-operand non-destructive vector ops.)

If the non-saturating add result is 0xFF, it will match the saturating-add result, and be detected as not clipping. This is why it's different from just checking the output of the saturating add for 0xFF bytes.

stgatilov · Answer

Another way to compare unsigned bytes: add 0x80 and compare them as signed ones.

__m128i _mm_cmplt_epu8(__m128i a, __m128i b) {
    __m128i as = _mm_add_epi8(a, _mm_set1_epi8((char)0x80));
    __m128i bs = _mm_add_epi8(b, _mm_set1_epi8((char)0x80));
    return _mm_cmplt_epi8(as, bs);
}

I don't think it is very efficient, but it works, and it may be useful in some cases. Also, you can use xor instead of addition if you want. In some cases you can even do bidirectional range checking at once, i.e. compare a value with both lower and upper bounds. To do so, align the lower bound with 0x80, similar to what this answer does.

SSE2 intrinsics - comparing unsigned integers

Tags:

c++

x86

simd

sse

intrinsics

uv_

3 Answers

Paul R

Peter Cordes

stgatilov

Recent Activity

Donate For Us

SSE2 intrinsics - comparing unsigned integers

Tags:

c++

x86

simd

sse

intrinsics

uv_

3 Answers

Paul R

Peter Cordes

stgatilov

Related questions

Recent Activity

Donate For Us