Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to horizontally sum SSE unsigned byte vector

Tags:

c++

x86

simd

sse

I need to horizontally add a __m128i that is 16 x epi8 values. The XOP instructions would make this trivial, but I don't have those available.

Current method is:

hd = _mm_hadd_epi16(_mm_cvtepi8_epi16(sum), _mm_cvtepi8_epi16(_mm_shuffle_epi8(sum, swap)));
hd = _mm_hadd_epi16(hd, hd);
hd = _mm_hadd_epi16(hd, hd);

Is there a better way with up to SSE4.1?

like image 488
Chase R Lewis Avatar asked May 03 '16 07:05

Chase R Lewis


1 Answers

You can do it with SSE2's _mm_sad_epu8 (psadbw), e.g.:

inline uint32_t _mm_sum_epu8(const __m128i v)
{
    __m128i vsum = _mm_sad_epu8(v, _mm_setzero_si128());
    return _mm_cvtsi128_si32(vsum) + _mm_extract_epi16(vsum, 4);
}

If you're summing more than one vector of bytes, use _mm_add_epi32 (or 64) on the vsum result, only doing the final horizontal sum of two 32 (or 64-bit) halves to scalar once at the end.

like image 151
Paul R Avatar answered Nov 17 '22 01:11

Paul R