Intel provides a C style function named _mm256_madd_epi16, which basically
__m256i _mm256_madd_epi16 (__m256i a, __m256i b)
Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Horizontally add adjacent pairs of intermediate 32-bit integers, and pack the results in dst.
Now I have two __m256i variables, each of them has 32 8-bit int in it.
I want to implement the same functionality as _mm256_madd_epi16
does, but each int32_t element in the result __m256i is the sum of four products of signed char instead of two pairs of signed int16_t
.
I can do that in a scalar loop:
alignas(32) uint32_t res[8];
for (int i = 0; i < 32; ++i)
res[i / 4] += _mm256_extract_epi8(a, i) * _mm256_extract_epi8(b, i);
return _mm256_load_si256((__m256i*)res);
Note that the multiply result is sign-extended to int
before adding, and that the _mm256_extract_epi8
helper function1returns signed __int8
. Nevermind that the total is uint32_t
instead of int32_t
; it can't overflow anyway with only four 8x8 => 16-bit numbers to add.
It looks very ugly, and doesn't runs efficiently unless the compiler does some magic to do it with SIMD instead of compiling as written to scalar extraction.
Footnote 1: _mm256_extract_epi8
is not an intrinsic. vpextrb
only works for the low lane of a 256-bit vector, and this helper function may allow an index that isn't a compile-time constant.
If one of your inputs is known to always be non-negative, you can use pmaddubsw
; the 8->16 bit equivalent of pmaddwd
. It does signed saturation to 16 bits if the sum overflows, which is possible, so you might need to avoid it if that's a problem for your case.
But otherwise, you can pmaddubsw
and then manually sign-extend the 16-bit elements to 32 and add them. Or use pmaddwd
against _mm256_set1_epi16(1)
to hsum pairs of elements with correct handling of the signs.
The obvious solution would be to unpack your input bytes to 16-bit elements with zero or sign-extension. Then you can use pmaddwd
twice, and add the results.
If your inputs are coming from memory, loading them with vpmovsxbw
might make sense. e.g.
__m256i a = _mm256_cvtepi8_epi16(_mm_loadu_si128((const __m128i*)&arr1[i]);
__m256i b = _mm256_cvtepi8_epi16(_mm_loadu_si128((const __m128i*)&arr2[i]);
But now you have the 4 bytes that you want spread out across two dwords, so you'd have to shuffle the result of one _mm256_madd_epi16(a,b)
. You could maybe use vphaddd
to shuffle and add two 256-bit vectors of products into one 256-bit vector of results you want, but that's a lot of shuffling.
So instead, I think we want to generate two 256-bit vectors from each 256-bit input vector: one with the high byte in each word sign-extended to 16, and the other with the low byte sign extended. We can do that with 3 shifts (for each input)
__m256i a = _mm256_loadu_si256(const __m256i*)&arr1[i]);
__m256i b = _mm256_loadu_si256(const __m256i*)&arr2[i]);
__m256i a_high = _mm256_srai_epi16(a, 8); // arithmetic right shift sign extends
// some compilers may only know the less-descriptive _mm256_slli_si256 name for vpslldq
__m256i a_low = _mm256_bslli_epi128(a, 1); // left 1 byte = low to high in each 16-bit element
a_low = _mm256_srai_epi16(a_low, 8); // arithmetic right shift sign extends
// then same for b_low / b_high
__m256i prod_hi = _mm256_madd_epi16(a_high, b_high);
__m256i prod_lo = _mm256_madd_epi16(a_low, b_low);
__m256i quadsum = _m256_add_epi32(prod_lo, prod_hi);
As an alternative to vplldq
by 1 byte, vpsllw
by 8 bits __m256i a_low = _mm256_slli_epi16(a, 8);
is the more "obvious" way to shift low to high within each word, and may be better if the surrounding code bottlenecks on shuffles. But normally it's worse because this code heavily bottlenecks on shift + vec-int multiply.
On KNL, you could use AVX512 vprold z,z,i
(Agner Fog doesn't show a timing for AVX512 vpslld z,z,i
) because it doesn't matter what you shift or shuffle into the low byte of each word; this is just setup for an arithmetic right shift.
Haswell runs vector shifts and vector-integer multiply only on port 0, so this badly bottlenecks on that. (Skylake is better: p0/p1). http://agner.org/optimize/.
We can use a shuffle (port 5) instead of the left shift as setup for an arithmetic right shift. This improves throughput and even reduces latency by reducing resource conflicts.
But we can avoid the shuffle control vector by using vpslldq
to do a vector byte shift. It's still an in-lane shuffle (shifting in zeros at the end of each lane), so it still has single-cycle latency. (My first idea was vpshufb
with a control vector like 14,14, 12,12, 10,10, ...
, then vpalignr
, then I remembered that simple old pslldq
has an AVX2 version. There are two names for the same instruction.
I like _mm256_bslli_epi128
because the b
for byte-shift distinguishes it as a shuffle, unlike the within-element bit-shifts. I didn't check which compiler supports what name for the 128-bit or 256-bit versions of the intrinsic.)
This also helps on AMD Ryzen. Vector shifts only run on one execution unit (P2), but shuffles can run on P1 or P2.
I haven't looked at AMD Ryzen execution port conflicts, but I'm pretty sure this won't be worse on any CPU (except KNL Xeon Phi, where AVX2 ops on elements smaller than a dword are all super slow). Shifts and in-lane shuffles are the same number of uops and same latency.
Zero-extending is cheaper than manually sign-extending, and avoids port bottlenecks. a_low
and/or b_low
can be created with _mm256_and_si256(a, _mm256_set1_epi16(0x00ff))
.
a_high
and/or b_high
can be created with a shuffle instead of shift. (pshufb
zeros the element when the shuffle-control vector has its high bit set).
const _mm256i pshufb_emulate_srl8 = _mm256_set_epi8(
0x80,15, 0x80,13, 0x80,11, ...,
0x80,15, 0x80,13, 0x80,11, ...);
__m256i a_high = _mm256_shuffle_epi8(a, pshufb_emulate_srl8); // zero-extend
Shuffle throughput is also limited to 1 per clock on mainstream Intel, so you could bottleneck on shuffles if you go overboard. But at least it's not the same port as the multiply. If only the high bytes are known non-negative, replacing vpsra/lw
with vpshufb
could help. Unaligned loads so those high bytes are low bytes could be more helpful, setting up for vpand
for a_low
and/or b_low
.
pmaddubsw
: I think this is usable if at least one input is non-negative (and thus can be treated as unsigned)It treats one input as signed, the other as unsigned, and does i8 x u8 => i16, then adds horizontal pairs to make 16-bit integers (with signed saturation because the sums can overflow. This might also rule it out for your use-case).
But possibly just use it and then add horizontal pairs with pmaddwd
against a constant 1
:
__m256i sum16 = _mm256_maddubs_epi16(a, b);
__m256i sum32 = _mm256_madd_epi16(sum16, _mm256_set1(1));
(pmaddwd
for a horizontal 16=>32-bit sum is maybe higher latency than shift / and / add, but does treat everything as signed. And it's only a single uop so it's good for throughput.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With