as the question says, I have a normal int that is 8 packed values of 4 bits each, and I would like to zero-extend that into a 256-bit vector register. Is that possible with sse/avx/avx2 ?
The solution by chtz (called cvt_nib_epi32_chtz in the remainder) is very suitable
for general purposes. However, in some specific cases, the solutions presented below might
be slightly more efficient:
/* gcc -O3 -m64 -Wall -march=skylake cvt_nib_epi32.c */
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>
__m256i cvt_nib_epi32_SKL(uint32_t x) { /* Efficient on Intel Skylake and newer */
/* Broadcast x to 8 elements */
__m256i input = _mm256_set1_epi32(x);
/* Shift the nibbles to the right position */
__m256i shifted = _mm256_srlv_epi32(input,_mm256_set_epi32(28,24,20,16,12,8,4,0));
/* Mask off the unwanted bits and return */
return _mm256_and_si256(shifted, _mm256_set1_epi32(0xF));
}
__m256i cvt_nib_epi32_HSW(uint32_t x) { /* Efficient on intel Haswell and Broadwell */
/* Very inefficient in AMD Zen! */
__uint64_t x_b = _pdep_u64(x, 0x0F0F0F0F0F0F0F0F); /* Expand nibbles to bytes */
__m128i x_v = _mm_cvtsi64_si128(x_b); /* Move x_b from GPR to AVX vector register */
return _mm256_cvtepu8_epi32(x_v); /* Convert bytes to integer elements and return */
}
The following assembly is generated by gcc:
cvt_nib_epi32_SKL:
vmovd xmm0, edi
vpbroadcastd ymm0, xmm0
vpsrlvd ymm0, ymm0, YMMWORD PTR .LC0[rip]
vpand ymm0, ymm0, YMMWORD PTR .LC1[rip]
ret
cvt_nib_epi32_HSW:
movabs rax, 1085102592571150095
mov edi, edi
pdep rdi, rdi, rax
vmovq xmm0, rdi
vpmovzxbd ymm0, xmm0
ret
cvt_nib_epi32_chtz:
vmovd xmm0, edi
vpsrld xmm1, xmm0, 4
vpunpcklbw xmm0, xmm0, xmm1
vpand xmm0, xmm0, XMMWORD PTR .LC2[rip]
vpmovzxbd ymm0, xmm0
ret
Function cvt_nib_epi32_chtz is very suitable for the AMD zen microarchitecture,
because it doesn't use instructions pdep and vpsrlvd, which are slow on these processors.
On Intel processors, cvt_nib_epi32_chtz may suffer
from high port 5 (p5) pressure, depending on the surrounding code,
because vmovd, vpunpcklbw, and vpmovzxbd, all execute on p5.
The other functions decode to only 2 p5 uops.
The Skylake solution cvt_nib_epi32_SKL uses the vpsrlvd, which is slow
on Intel Haswell and Broadwell.
For these processors cvt_nib_epi32_HSW is suitable. It uses the BMI2 instruction pdep, which is very(!) slow on
the AMD zen microarchitecture. Note that cvt_nib_epi32_HSW should also work well on Intel Skylake, but
(again) the actual performance depends on the surrounding code.
Note that in a loop context the constant loading, such as YMMWORD PTR .LC0[rip], and movabs rax, 1085102592571150095,
is likely hoisted out of the loop. In that case only 4 uops are needed by
cvt_nib_epi32_HSW and cvt_nib_epi32_SKL.
Here is a solution that should keep the order:
__m256i foo(int x) {
__m128i input = _mm_cvtsi32_si128(x);
__m128i even = input;
// move odd nibbles to even positions:
__m128i odd = _mm_srli_epi32(input,4);
// interleave: (only lower 64bit are used)
__m128i inter = _mm_unpacklo_epi8(even, odd);
// mask out wrong nibbles:
__m128i masked = _mm_and_si128(inter, _mm_set1_epi32(0x0f0f0f0f));
// convert to 32bit:
return _mm256_cvtepu8_epi32(masked);
}
Godbolt link: https://godbolt.org/z/8RLUVE
You could get slightly more efficient, if you load two or four int32 at once for the interleaving and masking of the even and odd nibbles. (This would result in multiple __m256i vectors, of course)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With