Having 32 bits stored in a uint32_t
in memory, what's the fastest way to unpack each bit to a separate byte element of an AVX register? The bits can be in any position within their respective byte.
Edit: to clarify, I mean bit 0 goes to byte 0, bit 1 to byte 1. Obviously all other bits within the byte on zero. Best I could at the moment is 2 PSHUFB
and having a mask register for each position.
If the uint32_t
is a bitmap, then the corresponding vector elements should be 0 or non-0. (i.e. so we could get a vector mask with a vpcmpeqb
against a vector of all-zero).
https://software.intel.com/en-us/forums/topic/283382
To "broadcast" the 32 bits of a 32-bit integer x
to 32 bytes of a 256-bit YMM register z
or 16 bytes of a two 128-bit XMM registers z_low
and z_high
you can do the following.
With AVX2:
__m256i y = _mm256_set1_epi32(x);
__m256i z = _mm256_shuffle_epi8(y,mask1);
z = _mm256_and_si256(z,mask2);
Without AVX2 it's best to do this with SSE:
__m128i y = _mm_set1_epi32(x);
__m128i z_low = _mm_shuffle_epi8(y,mask_low);
__m128i z_high = _mm_shuffle_epi8(y,mask_high);
z_low = _mm_and_si128(z_low ,mask2);
z_high = _mm_and_si128(z_high,mask2);
The masks and a working example are shown below. If you plan to do this several times you should probably define the masks outside of the main loop.
#include <immintrin.h>
#include <stdio.h>
int main() {
int x = 0x87654321;
static const char mask1a[32] = {
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x01, 0x01, 0x01, 0x01,
0x01, 0x01, 0x01, 0x01,
0x02, 0x02, 0x02, 0x02,
0x02, 0x02, 0x02, 0x02,
0x03, 0x03, 0x03, 0x03,
0x03, 0x03, 0x03, 0x03
};
static const char mask2a[32] = {
0x01, 0x02, 0x04, 0x08,
0x10, 0x20, 0x40, 0x80,
0x01, 0x02, 0x04, 0x08,
0x10, 0x20, 0x40, 0x80,
0x01, 0x02, 0x04, 0x08,
0x10, 0x20, 0x40, 0x80,
0x01, 0x02, 0x04, 0x08,
0x10, 0x20, 0x40, 0x80,
};
char out[32];
#if defined ( __AVX2__ )
__m256i mask2 = _mm256_loadu_si256((__m256i*)mask2a);
__m256i mask1 = _mm256_loadu_si256((__m256i*)mask1a);
__m256i y = _mm256_set1_epi32(x);
__m256i z = _mm256_shuffle_epi8(y,mask1);
z = _mm256_and_si256(z,mask2);
_mm256_storeu_si256((__m256i*)out,z);
#else
__m128i mask2 = _mm_loadu_si128((__m128i*)mask2a);
__m128i mask_low = _mm_loadu_si128((__m128i*)&mask1a[ 0]);
__m128i mask_high = _mm_loadu_si128((__m128i*)&mask1a[16]);
__m128i y = _mm_set1_epi32(x);
__m128i z_low = _mm_shuffle_epi8(y,mask_low);
__m128i z_high = _mm_shuffle_epi8(y,mask_high);
z_low = _mm_and_si128(z_low,mask2);
z_high = _mm_and_si128(z_high,mask2);
_mm_storeu_si128((__m128i*)&out[ 0],z_low);
_mm_storeu_si128((__m128i*)&out[16],z_high);
#endif
for(int i=0; i<8; i++) {
for(int j=0; j<4; j++) {
printf("%x ", out[4*i+j]);
}printf("\n");
} printf("\n");
}
It takes one extra step _mm256_cmpeq_epi8
against all-zeros. Any non-zero turns into 0, and zero turns into -1. If we don't want this inversion, use andnot
instead of and
. It inverts its first operand.
__m256i expand_bits_to_bytes(uint32_t x)
{
__m256i xbcast = _mm256_set1_epi32(x); // we only use the low 32bits of each lane, but this is fine with AVX2
// Each byte gets the source byte containing the corresponding bit
__m256i shufmask = _mm256_set_epi64x(
0x0303030303030303, 0x0202020202020202,
0x0101010101010101, 0x0000000000000000);
__m256i shuf = _mm256_shuffle_epi8(xbcast, shufmask);
__m256i andmask = _mm256_set1_epi64x(0x8040201008040201); // every 8 bits -> 8 bytes, pattern repeats.
__m256i isolated_inverted = _mm256_andnot_si256(shuf, andmask);
// this is the extra step: compare each byte == 0 to produce 0 or -1
return _mm256_cmpeq_epi8(isolated_inverted, _mm256_setzero_si256());
// alternative: compare against the AND mask to get 0 or -1,
// avoiding the need for a vector zero constant.
}
See it on the Godbolt Compiler Explorer.
Also see is there an inverse instruction to the movemask instruction in intel avx2? for other element sizes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With