Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal uint8_t bitmap into a 8 x 32bit SIMD "bool" vector

Tags:

c++11

avx

simd

avx2

As part of a compression algorithm, I am looking for the optimal way to achieve the following:

I have a simple bitmap in a uint8_t. For example 01010011

What I want is a __m256i of the form: (0, maxint, 0, maxint, 0, 0, maxint, maxint)

One way to achieve this is by shuffling a vector of 8 x maxint into a vector of zeros. But that first requires me to expand my uint8_t to the right shuffle bitmap.

I am wondering if there is a better way?

like image 252
Thomas Kejser Avatar asked Dec 11 '25 13:12

Thomas Kejser


1 Answers

I think I'd probably go for the "brute force and ignorance" approach initially, maybe something like this:

uint8_t u = 0x53; // 01010011

const union {
    uint32_t a[4];
    __m128i v;
} kLUT[16] = { { {  0,  0,  0,  0 } },
               { { -1,  0,  0,  0 } },
               { {  0, -1,  0,  0 } },
               { { -1, -1,  0,  0 } },
               { {  0,  0, -1,  0 } },
               { { -1,  0, -1,  0 } },
               { {  0, -1, -1,  0 } },
               { { -1, -1, -1,  0 } },
               { {  0,  0,  0, -1 } },
               { { -1,  0,  0, -1 } },
               { {  0, -1,  0, -1 } },
               { { -1, -1,  0, -1 } },
               { {  0,  0, -1, -1 } },
               { { -1,  0, -1, -1 } },
               { {  0, -1, -1, -1 } },
               { { -1, -1, -1, -1 } } };
__m256i v = _mm256_set_m128i(kLUT[u >> 4].v, kLUT[u & 15].v);

Using clang -O3 this compiles to:

movl    %ebx, %eax                ;; eax = ebx = u
andl    $15, %eax                 ;; get low offset = (u & 15) * 16
shlq    $4, %rax
leaq    _main.kLUT(%rip), %rcx    ;; rcx = kLUT
vmovaps (%rax,%rcx), %xmm0        ;; load low half of ymm0 from kLUT
andl    $240, %ebx                ;; get high offset = (u >> 4) * 16
vinsertf128 $1, (%rbx,%rcx), %ymm0, %ymm0
                                  ;; load high half of ymm0 from kLUT

FWIW I threw together a simple test harness for three implementations: (i) a simple scalar code reference implementation, (ii) the above code, (iii) an implementation based on @Zboson's answer, (iv) a slightly improved version of (iii) and (v) a further improvement on (iv) using a suggestion from @MarcGlisse. I got the following results with a 2.6GHz Haswell CPU (compiled with clang -O3):

scalar code:                                 7.55336 ns / vector
Paul R:                                      1.36016 ns / vector
Z boson:                                     1.24863 ns / vector
Z boson (improved):                          1.07590 ns / vector
Z boson (improved + @MarcGlisse suggestion): 1.08195 ns / vector

So @Zboson's solution(s) win, by around 10% - 20%, presumably because they need only 1 load, versus 2 for mine.

If we get any other implementations I'll add these to the test harness and update the results.


Slightly improved version of @Zboson's implementation:
__m256i v = _mm256_set1_epi8(u);
v = _mm256_and_si256(v, mask);
v = _mm256_xor_si256(v, mask);
return _mm256_cmpeq_epi32(v, _mm256_setzero_si256());


Further improved version of @Zboson's implementation incorporating suggestion from @MarcGlisse:
__m256i v = _mm256_set1_epi8(u);
v = _mm256_and_si256(v, mask);
return _mm256_cmpeq_epi32(v, mask);

(Note that mask needs to contain replicated 8 bit values in each 32 bit element, i.e. 0x01010101, 0x02020202, ..., 0x80808080)


like image 88
Paul R Avatar answered Dec 14 '25 15:12

Paul R



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!