Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?

I'm working on an x86 or x86_64 machine. I have an array unsigned int a[32] all of whose elements have value either 0 or 1. I want to set the single variable unsigned int b so that (b >> i) & 1 == a[i] will hold for all 32 elements of a. I'm working with GCC on Linux (shouldn't matter much I guess).

What's the fastest way to do this in C?

like image 859
einpoklum Avatar asked Oct 05 '14 08:10

einpoklum


2 Answers

The fastest way on recent x86 processors is probably to make use of the MOVMSKB family of instructions which extract the MSBs of a SIMD word and pack them into a normal integer register.

I fear SIMD intrinsics are not really my thing but something along these lines ought to work if you've got an AVX2 equipped processor:

uint32_t bitpack(const bool array[32]) {
    __mm256i tmp = _mm256_loadu_si256((const __mm256i *) array);
    tmp = _mm256_cmpgt_epi8(tmp, _mm256_setzero_si256());
    return _mm256_movemask_epi8(tmp);
}

Assuming sizeof(bool) = 1. For older SSE2 systems you will have to string together a pair of 128-bit operations instead. Aligning the array on a 32-byte boundary and should save another cycle or so.

like image 116
doynax Avatar answered Nov 13 '22 01:11

doynax


If sizeof(bool) == 1 then you can pack 8 bools at a time into 8 bits (more with 128-bit multiplications) using the technique discussed here in a computer with fast multiplication like this

inline int pack8b(bool* a)
{
    uint64_t t = *((uint64_t*)a);
    return (0x8040201008040201*t >> 56) & 0xFF;
}

int pack32b(bool* a)
{
    return (pack8b(a +  0) << 24) | (pack8b(a +  8) << 16) |
           (pack8b(a + 16) <<  8) | (pack8b(a + 24) <<  0);
}

Explanation:

Suppose the bools a[0] to a[7] have their least significant bits named a-h respectively. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)

  |  a7  ||  a6  ||  a4  ||  a4  ||  a3  ||  a2  ||  a1  ||  a0  |
  .......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
  ────────────────────────────────────────────────────────────────
  ↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
  ↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
  ↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
  ↑..d.....↑.c......↑b.......a
  ↑.c......↑b.......a
  b.......a
  a       
  ────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out

So by using the magic number 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201 we have the above code

Of course you need to make sure that the bool array is correctly 8-byte aligned. You can also unroll the code and optimize it, like shift only once instead of shifting left 56 bits


Sorry I overlooked the question and saw doynax's bool array as well as misread "32 0/1 values" and thought they're 32 bools. Of course the same technique can also be used to pack multiple uint32_t or uint16_t values (or other distribution of bits) at the same time but it's a lot less efficient than packing bytes

On newer x86 CPUs with BMI2 the PEXT instruction can be used. The pack8b function above can be replaced with

_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);

And to pack 2 uint32_t as the question requires use

_pext_u64(*((uint64_t*)a), (1ULL << 32) | 1ULL);
like image 28
phuclv Avatar answered Nov 13 '22 01:11

phuclv