I have an unsigned 32-bit integer value that contains multiple groups of bits. All bits within one group are always the same. Each group has the same size n
(from 1 to 32 and is unknown at compile time). There are no gaps between the groups.
I need to convert each group of these bits into one bit. Next, I need to pack these bits as depicted below:
I have come up with the following C function that iterates over these bits and packs them for an arbitrary n
between 1 and 32.
uint32_t PackBits(uint32_t value, uint8_t n)
{
if (n<2) return value;
uint32_t res = 0;
uint32_t mask = 1; // Also works for: mask= 1 << n-1; mask= 1 << n-2; mask= 1 << n-3; ... ; mask= 1 << n-n;
uint8_t s = 0;
do {
res |= (value & mask) >> s;
mask <<= n;
s += n-1;
} while (mask);
return res;
}
Q: Is there a way to accomplish this goal in C without looping over these bits, with some clever sequence of multiplications, De Bruijn1, carryless additions, xor, negations, etc... ?
For example, by taking advantage of the following bit-gathering property of multiplication, where d
,c
,b
,a
denote the bits representatively chosen from each bit group of the input value:
byte 3 byte 2 byte 1 byte 0
├──────┤├──────┤├──────┤├──────┤
ddddddddccccccccbbbbbbbbaaaaaaaa (input value)
& 00000001100000000000000110000000 (bitwise "and" mask 0x1800180 for extracting the chosen bits)
────────────────────────────────
= 0000000dc00000000000000ba0000000 (chosen bits generated by the bitwise "and" operation)
× 00000000000000000100000000000001 (a constant magic multiplier 0x4001. Works only for n==8)
────────────────────────────────
= 0000000dcba000000000000ba0000000 (the product modulo 2^32)
Note that the "product modulo 2^32" can be right shifted by 21 to yield the correct output value
Similar bit-gathering operations can be made with different constant multipliers for n
between 6
and 31
.
Maybe a similar bit-gathering operations can be made with division, too.
Footnote 1: De Bruijn sequences as perfect hash functions:
For n==2
, the operation (value*0x2AAB & 0x7FFF8000) >> 15
generates a UNIQUE 16-bit number which could index a 64K element LUT, or have something done to map it to the desired bits. 64K of uint16_t
s would fit in L2 cache on modern CPUs, but would only be reasonable if we were doing this for a lot of values in a row so it would pay for the demand misses to get data into L2 / L1d cache in the first place, and pay for the later cache misses from stuff that wouldn't have been evicted by other strategies that run more instructions but avoid a LUT.
For n==3
, (value*0x40024B7 & 0x0FFC0000) >> 18
generates a UNIQUE 10-bit value (which can index a 1024 element LUT).
For n==4
, (value*0x22AAF & 0x1FE00000) >> 21
generates a UNIQUE 8-bit value (which can index a 256 element LUT). A 256 element LUT is actually reasonable, especially since its entries are only uint8_t
in size.
For n==5
, (value*0x84ADF & 0x03F00000) >> 20
generates a UNIQUE 6-bit value (which can index a 64 element LUT). A 64 element LUT is actually reasonable, especially since its entries are only uint8_t
in size.
(These numbers haven't been checked carefully for uniqueness)
Probably it will be more efficient than looping as loops are not pipeline-friendly.
#define MASK(n, x) ((1U << (n + 1)) - 1) << ((x) * (n))
#define BITVAL(val, n , x) ((((val) & MASK(n,x)) == MASK(n,x)) << (x))
#define PACKBITS4(val) (BITVAL(val, 4, 0) + BITVAL(val, 4, 1) + BITVAL(val, 4, 2) + BITVAL(val, 4, 3))
#define PACKBITS5(val) (BITVAL(val, 5, 0) + BITVAL(val, 5, 1) + BITVAL(val, 5, 2))
it is for 16bits integers but you can hardcode all possible n
s. When n
is a constant expression it will generate code with no jumps, if not it will jump but not loop
https://godbolt.org/z/nzqYsee3Y
It can be potentially faster but it has to be tested.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With