How to implement an efficient _mm256_madd_epi8?

Question

Intel provides a C style function named _mm256_madd_epi16, which basically

__m256i _mm256_madd_epi16 (__m256i a, __m256i b)

Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Horizontally add adjacent pairs of intermediate 32-bit integers, and pack the results in dst.

Now I have two __m256i variables, each of them has 32 8-bit int in it.

I want to implement the same functionality as _mm256_madd_epi16 does, but each int32_t element in the result __m256i is the sum of four products of signed char instead of two pairs of signed int16_t.

I can do that in a scalar loop:

  alignas(32) uint32_t res[8];
  for (int i = 0; i < 32; ++i)
      res[i / 4] += _mm256_extract_epi8(a, i) * _mm256_extract_epi8(b, i);
  return _mm256_load_si256((__m256i*)res);

Note that the multiply result is sign-extended to int before adding, and that the _mm256_extract_epi8 helper function¹returns signed __int8. Nevermind that the total is uint32_t instead of int32_t; it can't overflow anyway with only four 8x8 => 16-bit numbers to add.

It looks very ugly, and doesn't runs efficiently unless the compiler does some magic to do it with SIMD instead of compiling as written to scalar extraction.

Footnote 1: _mm256_extract_epi8 is not an intrinsic. vpextrb only works for the low lane of a 256-bit vector, and this helper function may allow an index that isn't a compile-time constant.

Peter Cordes · Accepted Answer

If one of your inputs is known to always be non-negative, you can use pmaddubsw; the 8->16 bit equivalent of pmaddwd. It does signed saturation to 16 bits if the sum overflows, which is possible, so you might need to avoid it if that's a problem for your case.

But otherwise, you can pmaddubsw and then manually sign-extend the 16-bit elements to 32 and add them. Or use pmaddwd against _mm256_set1_epi16(1) to hsum pairs of elements with correct handling of the signs.

The obvious solution would be to unpack your input bytes to 16-bit elements with zero or sign-extension. Then you can use pmaddwd twice, and add the results.

If your inputs are coming from memory, loading them with vpmovsxbw might make sense. e.g.

__m256i a = _mm256_cvtepi8_epi16(_mm_loadu_si128((const __m128i*)&arr1[i]);
__m256i b = _mm256_cvtepi8_epi16(_mm_loadu_si128((const __m128i*)&arr2[i]);

But now you have the 4 bytes that you want spread out across two dwords, so you'd have to shuffle the result of one _mm256_madd_epi16(a,b). You could maybe use vphaddd to shuffle and add two 256-bit vectors of products into one 256-bit vector of results you want, but that's a lot of shuffling.

So instead, I think we want to generate two 256-bit vectors from each 256-bit input vector: one with the high byte in each word sign-extended to 16, and the other with the low byte sign extended. We can do that with 3 shifts (for each input)

 __m256i a = _mm256_loadu_si256(const  __m256i*)&arr1[i]);
 __m256i b = _mm256_loadu_si256(const  __m256i*)&arr2[i]);

 __m256i a_high = _mm256_srai_epi16(a, 8);     // arithmetic right shift sign extends
     // some compilers may only know the less-descriptive _mm256_slli_si256 name for vpslldq
 __m256i a_low =  _mm256_bslli_epi128(a, 1);   // left 1 byte = low to high in each 16-bit element
         a_low =  _mm256_srai_epi16(a_low, 8); // arithmetic right shift sign extends

    // then same for b_low / b_high

 __m256i prod_hi = _mm256_madd_epi16(a_high, b_high);
 __m256i prod_lo = _mm256_madd_epi16(a_low, b_low);

 __m256i quadsum = _m256_add_epi32(prod_lo, prod_hi);

As an alternative to vplldq by 1 byte, vpsllw by 8 bits __m256i a_low = _mm256_slli_epi16(a, 8); is the more "obvious" way to shift low to high within each word, and may be better if the surrounding code bottlenecks on shuffles. But normally it's worse because this code heavily bottlenecks on shift + vec-int multiply.

On KNL, you could use AVX512 vprold z,z,i (Agner Fog doesn't show a timing for AVX512 vpslld z,z,i) because it doesn't matter what you shift or shuffle into the low byte of each word; this is just setup for an arithmetic right shift.

Execution port bottlenecks:

Haswell runs vector shifts and vector-integer multiply only on port 0, so this badly bottlenecks on that. (Skylake is better: p0/p1). http://agner.org/optimize/.

We can use a shuffle (port 5) instead of the left shift as setup for an arithmetic right shift. This improves throughput and even reduces latency by reducing resource conflicts.

But we can avoid the shuffle control vector by using vpslldq to do a vector byte shift. It's still an in-lane shuffle (shifting in zeros at the end of each lane), so it still has single-cycle latency. (My first idea was vpshufb with a control vector like 14,14, 12,12, 10,10, ..., then vpalignr, then I remembered that simple old pslldq has an AVX2 version. There are two names for the same instruction. I like _mm256_bslli_epi128 because the b for byte-shift distinguishes it as a shuffle, unlike the within-element bit-shifts. I didn't check which compiler supports what name for the 128-bit or 256-bit versions of the intrinsic.)

This also helps on AMD Ryzen. Vector shifts only run on one execution unit (P2), but shuffles can run on P1 or P2.

I haven't looked at AMD Ryzen execution port conflicts, but I'm pretty sure this won't be worse on any CPU (except KNL Xeon Phi, where AVX2 ops on elements smaller than a dword are all super slow). Shifts and in-lane shuffles are the same number of uops and same latency.

If any elements are known non-negative, sign-extend = zero-extend

Zero-extending is cheaper than manually sign-extending, and avoids port bottlenecks. a_low and/or b_low can be created with _mm256_and_si256(a, _mm256_set1_epi16(0x00ff)).

a_high and/or b_high can be created with a shuffle instead of shift. (pshufb zeros the element when the shuffle-control vector has its high bit set).

 const _mm256i pshufb_emulate_srl8 = _mm256_set_epi8(
               0x80,15, 0x80,13, 0x80,11, ...,
               0x80,15, 0x80,13, 0x80,11, ...);

 __m256i a_high = _mm256_shuffle_epi8(a, pshufb_emulate_srl8);  // zero-extend

Shuffle throughput is also limited to 1 per clock on mainstream Intel, so you could bottleneck on shuffles if you go overboard. But at least it's not the same port as the multiply. If only the high bytes are known non-negative, replacing vpsra/lw with vpshufb could help. Unaligned loads so those high bytes are low bytes could be more helpful, setting up for vpand for a_low and/or b_low.

`pmaddubsw`: I think this is usable if at least one input is non-negative (and thus can be treated as unsigned)

It treats one input as signed, the other as unsigned, and does i8 x u8 => i16, then adds horizontal pairs to make 16-bit integers (with signed saturation because the sums can overflow. This might also rule it out for your use-case).

But possibly just use it and then add horizontal pairs with pmaddwd against a constant 1:

__m256i sum16 = _mm256_maddubs_epi16(a, b);
__m256i sum32 = _mm256_madd_epi16(sum16, _mm256_set1(1));

(pmaddwd for a horizontal 16=>32-bit sum is maybe higher latency than shift / and / add, but does treat everything as signed. And it's only a single uop so it's good for throughput.)

How to implement an efficient _mm256_madd_epi8?

Tags:

c++

x86

simd

intrinsics

avx2

Amor Fati

1 Answers

Execution port bottlenecks:

If any elements are known non-negative, sign-extend = zero-extend

`pmaddubsw`: I think this is usable if at least one input is non-negative (and thus can be treated as unsigned)

Peter Cordes

Recent Activity

Donate For Us

How to implement an efficient _mm256_madd_epi8?

Tags:

c++

x86

simd

intrinsics

avx2

Amor Fati

1 Answers

Execution port bottlenecks:

If any elements are known non-negative, sign-extend = zero-extend

pmaddubsw: I think this is usable if at least one input is non-negative (and thus can be treated as unsigned)

Peter Cordes

Related questions

Recent Activity

Donate For Us

`pmaddubsw`: I think this is usable if at least one input is non-negative (and thus can be treated as unsigned)