Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to achieve the effect of vpmovmskb on ZMM registers?

The quirky instruction (v)pmovmskb, dating back to SSE, takes the most significant bits of the bytes in an mm, xmm, or ymm register and moves them into a general purpose register. This is very useful for classifying vector elements or performing SWAR operations on individual bits. Specifically, I have used this instruction in a previous answer to compute a positional population count.

Unfortunately, this instruction has not been extended to ZMM registers and is surprisingly absent from the AVX-512 roster. How can I emulate its effect efficiently for ZMM registers? What similar/other options do I have?

like image 570
fuz Avatar asked Aug 07 '20 11:08

fuz


1 Answers

There is an instruction for that in AVX512BW, just with a different name. _mm512_movepi8_mask / vpmovb2m k, zmm, available in every element size from byte to qword.
(AVX512DQ for the D and Q versions, AVX512BW for the B and W versions).

There's also the mask->vector inverse movemask, vpmovm2b (again available in all element sizes).


AVX512 of course also has various cmp and test into mask instructions, so with a set1_epi8(1<<n) vector, you can grab any bit-position into a mask register with vptestmb k2{k1}, zmm2, zmm3/m512 ; _mm512_test_epi8_mask. Note that unlike vpmov2bm, it supports zero-masking into the destination to effectively AND with another k mask for free, so it might be worth using even if you just want the high bit.

There's also a NAND version vptestnmb. The D and Q versions of these support broadcast-memory source operands, but B and W versions don't.

With 8 different mask constants, you can extract different bits in an unrolled loop without spending any shift instructions. Or you can extract different bits from different elements.

These are all AVX512BW, available on AVX512 CPUs since Skylake-AVX512, but not Xeon Phi (KNL / KNM).

like image 93
Peter Cordes Avatar answered Oct 13 '22 00:10

Peter Cordes