The following function seems to not be available on AVX512:
__m512i _mm512_sign_epi16 (__m512i a, __m512i b)
Will it available soon or is there an alternative?
If you don't need the zeroing part, you only need 2 instructions (and a zeroed register):
You can _mm512_movepi16_mask()
the sign bits into a mask (AVX512 version of pmovmskb
), and do a merge-masked subtract from zero to negate a vector based on the signs of another.
#ifdef __AVX512BW__
// does *not* do anything special for signs[i] == 0, just negative / non-negative
__m512i conditional_negate(__m512i target, __m512i signs) {
__mmask32 negmask = _mm512_movepi16_mask(signs);
// vpsubw target{k1}, 0, target
__m512i neg = _mm512_mask_sub_epi16(target, negmask, _mm512_setzero_si512(), target);
return neg;
}
#endif
vector -> mask has 3 cycle latency on Skylake-X (with vpmovw2m
, vptestmw
, or vpcmpw
), but using the mask only has another 1 cycle latency. So the latency from inputs to outputs are:
signs
-> result on SKXtarget
-> result on SKX (just the masked vpsubw
from zero.)To also apply the is-zero condition: you may be able to zero-mask or merge-mask the next operation you do with the vector, so the elements that should have been zero are unused.
You need an extra compare to create another mask, but you probably don't need to waste a 2nd extra instruction to apply it right away.
If you really want to build a self-contained vpsignw
this way, we can do that final zero-masking, but this is 4 intrinsics that compile to 4 instructions, and probably worse for throughput than @wim's min/max/multiply. But this has good critical-path latency, with about 5 cycles total on SKX (or 4 if you can fold the final masking into something else). The critical path is signs->mask, then masked sub. The signs->nonzeromask can run in parallel with either of those.
__m512i mm512_psignw(__m512i target, __m512i signs) {
__mmask32 negmask = _mm512_movepi16_mask(signs);
// vpsubw target{negmask}, 0, target merge masking to only modify elements that need negating
__m512i neg = _mm512_mask_sub_epi16(target, negmask, _mm512_setzero_si512(), target);
__mmask32 nonzeromask = _mm512_test_epi16_mask(signs,signs); // per-element non-zero?
return _mm512_maskz_mov_epi16(nonzeromask, neg); // zero elements where signs was zero
}
Possibly the compiler can fold this zero-masking vmovdqu16
instrinsic into merge-masking for add
/or
/xor
, or zero-masking for multiply/and
. But probably a good idea to do that yourself.
A possible solution is:
__m512i mm512_sign_epi16(__m512i a, __m512i b){
/* Emulate _mm512_sign_epi16() with instructions */
/* that exist in the AVX-512 instruction set */
b = _mm512_min_epi16(b, _mm512_set1_epi16(1)); /* clamp b between -1 and 1 */
b = _mm512_max_epi16(b, _mm512_set1_epi16(-1)); /* now b = -1, 0 or 1 */
a = _mm512_mullo_epi16(a, b); /* apply the sign of b to a */
return a;
}
This solution should have reasonable throughput, but the latency might not be optimal due to the integer multiply. A good alternative is Peter Cordes' solution which has better latency. But in practice high throughput is usually more of interest than low latency.
Anyway, the actual performance of the different alternatives (the solution here, Peter Cordes' answer, and the splitting idea in chtz' comment) depends on the surrounding code and the type of CPU that executes the instructions. You'll have to benchmark the alternatives to see which one is fastest in your particular case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With