SIMD unpack 12-bit fields to 16-bit

Question

I need to unpack two 16-bit values from each 24 bits of input. (3 bytes -> 4 bytes). I already did it the naïve way but I'm not happy with the performance.

For example, InBuffer is __m128i:

value1 = (uint16_t)InBuffer[0:11]        // bit-ranges
value2 = (uint16_t)InBuffer[12:24]

value3 = (uint16_t)InBuffer[25:36] 
value4 = (uint16_t)InBuffer[37:48]
... for all the 128 bits.

After the unpacking, The values should be stored in __m256i variable.

How can I solve this with AVX2? Probably using unpack / shuffle / permute intrinsics?

Peter Cordes · Accepted Answer

I'm assuming you're doing this in a loop over a large array. If you only used __m128i loads, you'd have 15 useful bytes, which would only produce 20 output bytes in your __m256i output. (Well, I guess the 21st byte of output would be present, as the 16th byte of the input vector, the first 8 bytes of a new bitfield. But then your next vector would need to shuffle differently.)

Much better to use 24 bytes of input, producing 32 bytes of output. Ideally with a load that splits down the middle, so the low 12 bytes are in the low 128-bit "lane", avoiding the need for a lane-crossing shuffle like _mm256_permutevar8x32_epi32. Instead you can just _mm256_shuffle_epi8 to put bytes where you want them, setting up for some shift/and.

// uses 24 bytes starting at p by doing a 32-byte load from p-4.
// Don't use this for the first vector of a page-aligned array, or the last
inline
__m256i unpack12to16(const char *p)
{
    __m256i v = _mm256_loadu_si256( (const __m256i*)(p-4) );
   // v= [ x H G F E | D C B A x ]   where each letter is a 3-byte pair of two 12-bit fields, and x is 4 bytes of garbage we load but ignore

    const __m256i bytegrouping =
        _mm256_setr_epi8(4,5, 5,6,  7,8, 8,9,  10,11, 11,12,  13,14, 14,15, // low half uses last 12B
                         0,1, 1,2,  3,4, 4,5,   6, 7,  7, 8,   9,10, 10,11); // high half uses first 12B
    v = _mm256_shuffle_epi8(v, bytegrouping);
    // each 16-bit chunk has the bits it needs, but not in the right position

    // in each chunk of 8 nibbles (4 bytes): [ f e d c | d c b a ]
    __m256i hi = _mm256_srli_epi16(v, 4);                              // [ 0 f e d | xxxx ]
    __m256i lo  = _mm256_and_si256(v, _mm256_set1_epi32(0x00000FFF));  // [ 0000 | 0 c b a ]

    return _mm256_blend_epi16(lo, hi, 0b10101010);
      // nibbles in each pair of epi16: [ 0 f e d | 0 c b a ] 
}

// Untested: I *think* I got my shuffle and blend controls right, but didn't check.

It compiles like this (Godbolt) with clang -O3 -march=znver2. Of course an inline version would load the vector constants once, outside a loop.

unpack12to16(char const*):                    # @unpack12to16(char const*)
        vmovdqu ymm0, ymmword ptr [rdi - 4]
        vpshufb ymm0, ymm0, ymmword ptr [rip + .LCPI0_0] # ymm0 = ymm0[4,5,5,6,7,8,8,9,10,11,11,12,13,14,14,15,16,17,17,18,19,20,20,21,22,23,23,24,25,26,26,27]
        vpsrlw  ymm1, ymm0, 4
        vpand   ymm0, ymm0, ymmword ptr [rip + .LCPI0_1]
        vpblendw        ymm0, ymm0, ymm1, 170           # ymm0 = ymm0[0],ymm1[1],ymm0[2],ymm1[3],ymm0[4],ymm1[5],ymm0[6],ymm1[7],ymm0[8],ymm1[9],ymm0[10],ymm1[11],ymm0[12],ymm1[13],ymm0[14],ymm1[15]
        ret

On Intel CPUs (before Ice Lake) vpblendw only runs on port 5 (https://uops.info/), competing with vpshufb (...shuffle_epi8). But it's a single uop (unlike vpblendvb variable-blend) with an immediate control. Still, that means a back-end ALU bottleneck of at best one vector per 2 cycles on Intel. If your src and dst are hot in L2 cache (or maybe only L1d), that might be the bottleneck, but this is already 5 uops for the front end, so with loop overhead and a store you're already close to a front-end bottleneck.

Blending with another vpand / vpor would cost more front-end uops but would mitigate the back-end bottleneck on Intel (before Ice Lake). It would be worse on AMD, where vpblendw can run on any of the 4 FP execution ports, and worse on Ice Lake where vpblendw can run on p1 or p5. And like I said, cache load/store throughput might be a bigger bottleneck than port 5 anyway, so fewer front-end uops are definitely better to let out-of-order exec see farther.

This may not be optimal; perhaps there's some way to set up for vpunpcklwd by getting the even (low) and odd (high) bit fields into the bottom 8 bytes of two separate input vectors even more cheaply? Or set up so we can blend with OR instead of needing to clear garbage in one input with vpblendw which only runs on port 5 on Skylake?

Or something we can do with vpsrlvd? (But not vpsrlvw - that would require AVX-512).

If you have AVX512VBMI, vpmultishiftqb is a parallel bitfield-extract. You'd just need to shuffle the right 3-byte pairs into the right 64-bit SIMD elements, then one _mm256_multishift_epi64_epi8 to put the good bits where you want them, and a _mm256_and_si256 to zero the high 4 bits of each 16-bit field will do the trick. (Can't quite take care of everything with 0-masking, or shuffling some zeros into the input for multishift, because there won't be any contiguous with the low 12-bit field.) Or you could set up for just an srli_epi16 that works for both low and high, instead of needing an AND constant, by having the multishift bitfield-extract line up both output fields with the bits you want at the top of the 16-bit element.

This may also allow a shuffle with larger granularity than bytes, although vpermb is actually fast on CPUs with AVX512VBMI, and unfortunately Ice Lake's vpermw is slower than vpermb.

With AVX-512 but not AVX512VBMI, working in 256-bit chunks lets us do the same thing as AVX2 but avoiding the blend. Instead, use merge-masking for the right shift, or vpsrlvw with a control vector to only shift the odd elements. For 256-bit vectors, this is probably as good as vpmultishiftqb.

SIMD unpack 12-bit fields to 16-bit

Tags:

c

avx

bit-fields

pixelformat

avx2

OC87

1 Answers

Peter Cordes

Recent Activity

Donate For Us

SIMD unpack 12-bit fields to 16-bit

Tags:

c

avx

bit-fields

pixelformat

avx2

OC87

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us