Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an function in AVX512 like _mm512_sign_epi16 (__m512i a, __m512i b)

The following function seems to not be available on AVX512:

__m512i _mm512_sign_epi16 (__m512i a, __m512i b)

Will it available soon or is there an alternative?

like image 345
yueluojieying Avatar asked Dec 18 '22 17:12

yueluojieying


2 Answers

If you don't need the zeroing part, you only need 2 instructions (and a zeroed register):

You can _mm512_movepi16_mask() the sign bits into a mask (AVX512 version of pmovmskb), and do a merge-masked subtract from zero to negate a vector based on the signs of another.

#ifdef __AVX512BW__
// does *not* do anything special for signs[i] == 0, just negative / non-negative
__m512i  conditional_negate(__m512i target, __m512i signs) {
    __mmask32 negmask = _mm512_movepi16_mask(signs);
      // vpsubw target{k1}, 0, target
    __m512i neg = _mm512_mask_sub_epi16(target, negmask, _mm512_setzero_si512(), target);
    return neg;
}
#endif

vector -> mask has 3 cycle latency on Skylake-X (with vpmovw2m, vptestmw, or vpcmpw), but using the mask only has another 1 cycle latency. So the latency from inputs to outputs are:

  • 4 cycles from signs -> result on SKX
  • 1 cycle from target -> result on SKX (just the masked vpsubw from zero.)

To also apply the is-zero condition: you may be able to zero-mask or merge-mask the next operation you do with the vector, so the elements that should have been zero are unused.

You need an extra compare to create another mask, but you probably don't need to waste a 2nd extra instruction to apply it right away.

If you really want to build a self-contained vpsignw this way, we can do that final zero-masking, but this is 4 intrinsics that compile to 4 instructions, and probably worse for throughput than @wim's min/max/multiply. But this has good critical-path latency, with about 5 cycles total on SKX (or 4 if you can fold the final masking into something else). The critical path is signs->mask, then masked sub. The signs->nonzeromask can run in parallel with either of those.

__m512i  mm512_psignw(__m512i target, __m512i signs) {
    __mmask32 negmask = _mm512_movepi16_mask(signs);
      // vpsubw target{negmask}, 0, target  merge masking to only modify elements that need negating
    __m512i neg = _mm512_mask_sub_epi16(target, negmask, _mm512_setzero_si512(), target);

    __mmask32 nonzeromask = _mm512_test_epi16_mask(signs,signs);  // per-element non-zero?
    return  _mm512_maskz_mov_epi16(nonzeromask, neg);        // zero elements where signs was zero
}

Possibly the compiler can fold this zero-masking vmovdqu16 instrinsic into merge-masking for add/or/xor, or zero-masking for multiply/and. But probably a good idea to do that yourself.

like image 99
Peter Cordes Avatar answered Dec 20 '22 07:12

Peter Cordes


A possible solution is:

__m512i mm512_sign_epi16(__m512i a, __m512i b){
    /* Emulate _mm512_sign_epi16() with instructions  */
    /* that exist in the AVX-512 instruction set      */
    b = _mm512_min_epi16(b, _mm512_set1_epi16(1));     /* clamp b between -1 and 1 */
    b = _mm512_max_epi16(b, _mm512_set1_epi16(-1));    /* now b = -1, 0 or 1       */
    a = _mm512_mullo_epi16(a, b);                      /* apply the sign of b to a */
    return a;
}

This solution should have reasonable throughput, but the latency might not be optimal due to the integer multiply. A good alternative is Peter Cordes' solution which has better latency. But in practice high throughput is usually more of interest than low latency.

Anyway, the actual performance of the different alternatives (the solution here, Peter Cordes' answer, and the splitting idea in chtz' comment) depends on the surrounding code and the type of CPU that executes the instructions. You'll have to benchmark the alternatives to see which one is fastest in your particular case.

like image 23
wim Avatar answered Dec 20 '22 06:12

wim