Intel vector instruction to zero-extend 8 4-bit values packed in a 32-bit int to a __m256i?

Question

as the question says, I have a normal int that is 8 packed values of 4 bits each, and I would like to zero-extend that into a 256-bit vector register. Is that possible with sse/avx/avx2 ?

wim · Accepted Answer

The solution by chtz (called cvt_nib_epi32_chtz in the remainder) is very suitable for general purposes. However, in some specific cases, the solutions presented below might be slightly more efficient:

/*     gcc -O3 -m64 -Wall -march=skylake cvt_nib_epi32.c     */
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>

__m256i cvt_nib_epi32_SKL(uint32_t x) {                /* Efficient on Intel Skylake and newer         */
                                                       /* Broadcast x to 8 elements                    */
    __m256i input   = _mm256_set1_epi32(x);            
                                                       /* Shift the nibbles to the right position      */
    __m256i shifted = _mm256_srlv_epi32(input,_mm256_set_epi32(28,24,20,16,12,8,4,0)); 
                                                       /* Mask off the unwanted bits and return        */
            return _mm256_and_si256(shifted, _mm256_set1_epi32(0xF)); 
}


__m256i cvt_nib_epi32_HSW(uint32_t x) {                /* Efficient on intel Haswell and Broadwell     */
                                                       /* Very inefficient in AMD Zen!                 */
    __uint64_t x_b = _pdep_u64(x, 0x0F0F0F0F0F0F0F0F); /* Expand nibbles to bytes                      */
    __m128i    x_v = _mm_cvtsi64_si128(x_b);           /* Move x_b from GPR to AVX vector register     */
    return _mm256_cvtepu8_epi32(x_v);                  /* Convert bytes to integer elements and return */
}

The following assembly is generated by gcc:

cvt_nib_epi32_SKL:
        vmovd   xmm0, edi
        vpbroadcastd    ymm0, xmm0
        vpsrlvd ymm0, ymm0, YMMWORD PTR .LC0[rip]
        vpand   ymm0, ymm0, YMMWORD PTR .LC1[rip]
        ret
cvt_nib_epi32_HSW:
        movabs  rax, 1085102592571150095
        mov     edi, edi
        pdep    rdi, rdi, rax
        vmovq   xmm0, rdi
        vpmovzxbd       ymm0, xmm0
        ret
cvt_nib_epi32_chtz:
        vmovd   xmm0, edi
        vpsrld  xmm1, xmm0, 4
        vpunpcklbw      xmm0, xmm0, xmm1
        vpand   xmm0, xmm0, XMMWORD PTR .LC2[rip]
        vpmovzxbd       ymm0, xmm0
        ret

Function cvt_nib_epi32_chtz is very suitable for the AMD zen microarchitecture, because it doesn't use instructions pdep and vpsrlvd, which are slow on these processors.

On Intel processors, cvt_nib_epi32_chtz may suffer from high port 5 (p5) pressure, depending on the surrounding code, because vmovd, vpunpcklbw, and vpmovzxbd, all execute on p5. The other functions decode to only 2 p5 uops.

The Skylake solution cvt_nib_epi32_SKL uses the vpsrlvd, which is slow on Intel Haswell and Broadwell. For these processors cvt_nib_epi32_HSW is suitable. It uses the BMI2 instruction pdep, which is very(!) slow on the AMD zen microarchitecture. Note that cvt_nib_epi32_HSW should also work well on Intel Skylake, but (again) the actual performance depends on the surrounding code.

Note that in a loop context the constant loading, such as YMMWORD PTR .LC0[rip], and movabs rax, 1085102592571150095, is likely hoisted out of the loop. In that case only 4 uops are needed by cvt_nib_epi32_HSW and cvt_nib_epi32_SKL.

chtz · Answer

Here is a solution that should keep the order:

__m256i foo(int x) {
    __m128i input = _mm_cvtsi32_si128(x);
    __m128i even  = input;
    // move odd nibbles to even positions:
    __m128i odd   = _mm_srli_epi32(input,4);
    // interleave: (only lower 64bit are used)
    __m128i inter = _mm_unpacklo_epi8(even, odd);
    // mask out wrong nibbles:
    __m128i masked = _mm_and_si128(inter, _mm_set1_epi32(0x0f0f0f0f));
    // convert to 32bit:
    return _mm256_cvtepu8_epi32(masked);
}

Godbolt link: https://godbolt.org/z/8RLUVE

You could get slightly more efficient, if you load two or four int32 at once for the interleaving and masking of the even and odd nibbles. (This would result in multiple __m256i vectors, of course)

Intel vector instruction to zero-extend 8 4-bit values packed in a 32-bit int to a __m256i?

Tags:

avx

sse

avx2

Brennan Vincent

2 Answers

wim

chtz

Recent Activity

Donate For Us

Intel vector instruction to zero-extend 8 4-bit values packed in a 32-bit int to a __m256i?

Tags:

avx

sse

avx2

Brennan Vincent

2 Answers

wim

chtz

Related questions

Recent Activity

Donate For Us