Why is masking needed before using a pshufb shuffle as a lookup table for nibbles?

Question

This code comes from https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx2-lookup.cpp.

std::uint64_t popcnt_AVX2_lookup(const uint8_t* data, const size_t n) {

    size_t i = 0;

    const __m256i lookup = _mm256_setr_epi8(
        /* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
        /* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
        /* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
        /* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4,

        /* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
        /* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
        /* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
        /* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4
    );

    const __m256i low_mask = _mm256_set1_epi8(0x0f);

    __m256i acc = _mm256_setzero_si256();

#define ITER { \
        const __m256i vec = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(data + i)); \
        const __m256i lo  = _mm256_and_si256(vec, low_mask); \
\\     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ why do we need this? 
        const __m256i hi  = _mm256_and_si256(_mm256_srli_epi16(vec, 4), low_mask); \
        const __m256i popcnt1 = _mm256_shuffle_epi8(lookup, lo); \
        const __m256i popcnt2 = _mm256_shuffle_epi8(lookup, hi); \
        local = _mm256_add_epi8(local, popcnt1); \
        local = _mm256_add_epi8(local, popcnt2); \
        i += 32; \
    }

    while (i + 8*32 <= n) {
        __m256i local = _mm256_setzero_si256();
        ITER ITER ITER ITER
        ITER ITER ITER ITER
        acc = _mm256_add_epi64(acc, _mm256_sad_epu8(local, _mm256_setzero_si256()));
    }

...rest are unrelated to the question

The code is used to replace the builtin_popcnt function, which counts the number of 1s in a given input in binary format. what bothers me are these two lines:

const __m256i lo  = _mm256_and_si256(vec, low_mask); \
const __m256i hi  = _mm256_and_si256(_mm256_srli_epi16(vec, 4), low_mask); \

according to Intel intrinsic guide https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2&ig_expand=6392,305,6221,6389,6389,6221,6188,6769,6389,124,6050,6389&text=mm256_shuffle ,the _mm256_shuffle_epi8 instruction only looks at the lower 4 bits of your packed chars b:

__m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)
FOR j := 0 to 15
    i := j*8
    IF b[i+7] == 1
        dst[i+7:i] := 0
    ELSE
        index[3:0] := b[i+3:i] 
\\     ^^^^^^^^^^^^^^^^^^^^^^ only look at lower 4 bits
        dst[i+7:i] := a[index*8+7:index*8]
    FI
    IF b[128+i+7] == 1
        dst[128+i+7:128+i] := 0
    ELSE
        index[3:0] := b[128+i+3:128+i]
        dst[128+i+7:128+i] := a[128+index*8+7:128+index*8]
    FI
ENDFOR
dst[MAX:256] := 0

So if I'm not mistaken, you can just do

const __m256i lo  = vec; \
const __m256i hi  = _mm256_srli_epi16(vec, 4); \

I'm sort of new to AVX, Please tell me if there's anything wrong here.

Peter Cordes · Accepted Answer

[v]pshufb looks at the high bit to zero that output element, unfortunately. In the pseudocode you quoted:

    IF b[i+7] == 1                 # if high-bit set
        dst[i+7:i] := 0            # zero that output element
    ELSE
         ... the part you were looking at   # else index the source

Tthe intrinsics guide only covers it in the pseudocode, not the text.
As usual, the asm manual entry's description is much more descriptive:

If the most significant bit (bit[7]) of each byte of the shuffle control mask is set, then constant zero is written in the result byte

It's useful for some problems, but for pshufb as a nibble-LUT it does require 2 [v]pand instructions. Including for the high nibbles, because x86 doesn't have a SIMD byte shift. The narrowest being psrlw 16-bit elements, so even the every other byte will get garbage shifted into its high bit. Unless your input data is known to always have those bit-positions clear.

AVX-512VBMI (Ice Lake and newer) vpermb doesn't have this downside, but is lane-crossing so it has 3c latency instead of 1 on CPUs that support it. Luckily it is still only 1 uop on Ice Lake, unlike vperm2tw and vpermt2b even on Ice Lake (https://uops.info). Unfortunately even vpermw is still slow on Intel CPUs with fast vpermb, so you can't use the backwards-compatible instruction and just get higher performance on newer CPUs.

But vpermb could be slower on any future CPUs that do AVX-512 by decoding into 2x 256-bit halves, like some future Intel Efficiency cores, if they don't special-case the vpermb xmm/ymm 128 / 256-bit versions. At least for latency, if not for throughput. (Alder Lake E-cores have 128-bit wide EU, and already split 256-bit vectors in two halves, and supporting AVX-512 with 4 uops per instruction would start to get silly, I guess. And unfortunately Intel didn't design a way to expose the new AVX-512 functionality at only 256-bit width (like masking and better shuffles, vpternlogd, etc.). Update: Intel eventually defined AVX10 for this, but didn't get AVX10.1 out the door before dropping the 256-bit-only option.)

Zen 4 has efficient handling of 512-bit instructions, still single-uop with at worst half throughput of 256-bit ops, the same uop occupying an execution unit for 2 cycle. (Update: Zen 5 has full-width 512-bit execution units.)

So unlike Zen 1 where lane-crossing AVX1/2 shuffles like vpermq and vperm2f128 were several uops because the shuffle units were truly only 128-bit wide, Zen 4 has 1/clock throughput for vpermb zmm, vs. 2/clock for vpermb ymm/xmm. The 512-bit version has 6 cycle latency, up from 4 cycle for ymm, 2 cycle for xmm. (https://uops.info/)

Using vpermb as a drop-in replacement for vpshufb, the LUT can still be broadcast-loaded from a 16-byte source, since it just repeats in each lane. Then you can leave bits above the 4th unzeroed, as long as index 0, 16, 32, and 48 all read the same value, etc.

Or of course it opens up the possibility of a wider LUT, like for extremely efficient base64 encoding with vpmultishiftqb for parallel bitfield extraction. (https://github.com/aklomp/base64/blob/master/lib/arch/avx512/enc_reshuffle_translate.c or https://github.com/WojciechMula/base64simd)

Why is masking needed before using a pshufb shuffle as a lookup table for nibbles?

Tags:

c++

avx

simd

sse

avx2

phosphorylation

1 Answers

Peter Cordes

Recent Activity

Donate For Us

Why is masking needed before using a pshufb shuffle as a lookup table for nibbles?

Tags:

c++

avx

simd

sse

avx2

phosphorylation

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us