I have very long byte arrays that need to be added to a destination array of type <code>short</code> (or <code>int</code>). Does such SSE instruction exist? Or maybe their set ?

You need to unpack each vector of 8 bit values to two vectors of 16 bit values and then add those. <pre class="prettyprint"><code>__m128i v = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0); __m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0)); // vl = { 7, 6, 5, 4, 3, 2, 1, 0 } __m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0)); // vh = { 15, 14, 13, 12, 11, 10, 9, 8 } </code></pre> where <code>v</code> is a vector of 16 x 8 bit values and <code>vl</code>, <code>vh</code> are the two unpacked vectors of 8 x 16 bit values. Note that I'm assuming that the 8 bit values are unsigned so when unpacking to 16 bits the high byte is set to 0 (i.e. no sign extension). If you want to sum a lot of these vectors and get a 32 bit result then a useful trick is to use <code>_mm_madd_epi16</code> with a multiplier of 1, e.g. <pre class="prettyprint"><code>__m128i vsuml = _mm_set1_epi32(0); __m128i vsumh = _mm_set1_epi32(0); __m128i vsum; int sum; for (int i = 0; i < N; i += 16) { __m128i v = _mm_load_si128(&x[i]); __m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0)); __m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0)); vsuml = _mm_add_epi32(vsuml, _mm_madd_epi16(vl, _mm_set1_epi16(1))); vsumh = _mm_add_epi32(vsumh, _mm_madd_epi16(vh, _mm_set1_epi16(1))); } // do horizontal sum of 4 partial sums and store in scalar int vsum = _mm_add_epi32(vsuml, vsumh); vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8)); vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4)); sum = _mm_cvtsi128_si32(vsum); </code></pre>

If you need to sign-extend your byte vectors instead of zero-extend, use <code>pmovsxbw</code> (<code>_mm_cvtepi8_epi16</code>). Unlike the unpack hi/lo instructions, you can only pmovsx from the low half/quarter/eighth of a src register. You can pmovsx directly from memory though, even though intrinsics make this really clumsy. Since shuffle throughput is more limited than load throughput on most CPUs, it's probably preferable to do two load+pmovsx than to do one load + three shuffles.

SSE Instructions: Byte+Short

I have very long byte arrays that need to be added to a destination array of type short (or int). Does such SSE instruction exist? Or maybe their set ?

What are SSE instructions?

SSE instructions are an extension of the SIMD execution model introduced with the MMX technology. SSE instructions are divided into four subgroups: SIMD single-precision floating-point instructions that operate on the XMM registers

How many bits are there in SSE2?

SSE2 128-Bit SIMD Integer Instructions SSE2 Miscellaneous Instructions Operating System Support Instructions 64-Bit AMD Opteron Considerations A. Using the Assembler Command Line Index

What are 64-bit SIMD integer instructions (SSE)?

The SSE 64–bit SIMD integer instructions perform operations on packed bytes, words, or doublewords in MMX registers. Table 3-34 64–Bit SIMD Integer Instructions (SSE)

What are data transfer instructions (SSE)?

The SSE data transfer instructions move packed and scalar single-precision floating-point operands between XMM registers and between XMM registers and memory. Table 3-27 Data Transfer Instructions (SSE)

You need to unpack each vector of 8 bit values to two vectors of 16 bit values and then add those.

__m128i v = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
__m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0)); // vl = { 7, 6, 5, 4, 3, 2, 1, 0 }
__m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0)); // vh = { 15, 14, 13, 12, 11, 10, 9, 8 }

where v is a vector of 16 x 8 bit values and vl, vh are the two unpacked vectors of 8 x 16 bit values.

Note that I'm assuming that the 8 bit values are unsigned so when unpacking to 16 bits the high byte is set to 0 (i.e. no sign extension).

If you want to sum a lot of these vectors and get a 32 bit result then a useful trick is to use _mm_madd_epi16 with a multiplier of 1, e.g.

__m128i vsuml = _mm_set1_epi32(0);
__m128i vsumh = _mm_set1_epi32(0);
__m128i vsum;
int sum;

for (int i = 0; i < N; i += 16)
{
    __m128i v = _mm_load_si128(&x[i]);
    __m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0));
    __m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0));
    vsuml = _mm_add_epi32(vsuml, _mm_madd_epi16(vl, _mm_set1_epi16(1)));
    vsumh = _mm_add_epi32(vsumh, _mm_madd_epi16(vh, _mm_set1_epi16(1)));
}
// do horizontal sum of 4 partial sums and store in scalar int
vsum = _mm_add_epi32(vsuml, vsumh);
vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
sum = _mm_cvtsi128_si32(vsum);

If you need to sign-extend your byte vectors instead of zero-extend, use pmovsxbw (_mm_cvtepi8_epi16). Unlike the unpack hi/lo instructions, you can only pmovsx from the low half/quarter/eighth of a src register.

You can pmovsx directly from memory though, even though intrinsics make this really clumsy. Since shuffle throughput is more limited than load throughput on most CPUs, it's probably preferable to do two load+pmovsx than to do one load + three shuffles.

SSE Instructions: Byte+Short

Tags:

x86

sse

instructions

dajuric

People also ask

2 Answers

Paul R

Peter Cordes

Recent Activity

Donate For Us

SSE Instructions: Byte+Short

Tags:

x86

sse

instructions

dajuric

People also ask

2 Answers

Paul R

Peter Cordes

Related questions

Recent Activity

Donate For Us