Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SSE Instructions: Byte+Short

I have very long byte arrays that need to be added to a destination array of type short (or int). Does such SSE instruction exist? Or maybe their set ?

like image 347
dajuric Avatar asked May 17 '12 14:05

dajuric


People also ask

What are SSE instructions?

SSE instructions are an extension of the SIMD execution model introduced with the MMX technology. SSE instructions are divided into four subgroups: SIMD single-precision floating-point instructions that operate on the XMM registers

How many bits are there in SSE2?

SSE2 128-Bit SIMD Integer Instructions SSE2 Miscellaneous Instructions Operating System Support Instructions 64-Bit AMD Opteron Considerations A.   Using the Assembler Command Line Index

What are 64-bit SIMD integer instructions (SSE)?

The SSE 64–bit SIMD integer instructions perform operations on packed bytes, words, or doublewords in MMX registers. Table 3-34 64–Bit SIMD Integer Instructions (SSE)

What are data transfer instructions (SSE)?

The SSE data transfer instructions move packed and scalar single-precision floating-point operands between XMM registers and between XMM registers and memory. Table 3-27 Data Transfer Instructions (SSE)


2 Answers

You need to unpack each vector of 8 bit values to two vectors of 16 bit values and then add those.

__m128i v = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
__m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0)); // vl = { 7, 6, 5, 4, 3, 2, 1, 0 }
__m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0)); // vh = { 15, 14, 13, 12, 11, 10, 9, 8 }

where v is a vector of 16 x 8 bit values and vl, vh are the two unpacked vectors of 8 x 16 bit values.

Note that I'm assuming that the 8 bit values are unsigned so when unpacking to 16 bits the high byte is set to 0 (i.e. no sign extension).

If you want to sum a lot of these vectors and get a 32 bit result then a useful trick is to use _mm_madd_epi16 with a multiplier of 1, e.g.

__m128i vsuml = _mm_set1_epi32(0);
__m128i vsumh = _mm_set1_epi32(0);
__m128i vsum;
int sum;

for (int i = 0; i < N; i += 16)
{
    __m128i v = _mm_load_si128(&x[i]);
    __m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0));
    __m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0));
    vsuml = _mm_add_epi32(vsuml, _mm_madd_epi16(vl, _mm_set1_epi16(1)));
    vsumh = _mm_add_epi32(vsumh, _mm_madd_epi16(vh, _mm_set1_epi16(1)));
}
// do horizontal sum of 4 partial sums and store in scalar int
vsum = _mm_add_epi32(vsuml, vsumh);
vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
sum = _mm_cvtsi128_si32(vsum);
like image 137
Paul R Avatar answered Sep 19 '22 13:09

Paul R


If you need to sign-extend your byte vectors instead of zero-extend, use pmovsxbw (_mm_cvtepi8_epi16). Unlike the unpack hi/lo instructions, you can only pmovsx from the low half/quarter/eighth of a src register.

You can pmovsx directly from memory though, even though intrinsics make this really clumsy. Since shuffle throughput is more limited than load throughput on most CPUs, it's probably preferable to do two load+pmovsx than to do one load + three shuffles.

like image 44
Peter Cordes Avatar answered Sep 20 '22 13:09

Peter Cordes