I need to shuffle a 16 bit unsigned integer in a way that the even indexes land in the lower byte, and the odd indexes land in the upper byte. <pre class="prettyprint"><code>input: fedcba98 76543210 (contiguously numbered) output: fdb97531 eca86420 (even and odd separated) </code></pre> My code looks like this at the moment: <pre class="prettyprint"><code>typedef unsigned short u16; u16 segregate(u16 x) { u16 g = (x & 0x0001); u16 h = (x & 0x0004) >> 1; u16 i = (x & 0x0010) >> 2; u16 j = (x & 0x0040) >> 3; u16 k = (x & 0x0100) >> 4; u16 l = (x & 0x0400) >> 5; u16 m = (x & 0x1000) >> 6; u16 n = (x & 0x4000) >> 7; u16 o = (x & 0x0002) << 7; u16 p = (x & 0x0008) << 6; u16 q = (x & 0x0020) << 5; u16 r = (x & 0x0080) << 4; u16 s = (x & 0x0200) << 3; u16 t = (x & 0x0800) << 2; u16 u = (x & 0x2000) << 1; u16 v = (x & 0x8000); return g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v; } </code></pre> I wonder if there is a more elegant solution than simply extracting and shifting each individual bit?

There is a very convenient web resource that helps solving many bit permutation problems: Code generator for bit permutations. In this particular case feeding "0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15" to this page produces pretty fast code. Unfortunately this code generator cannot produce 64-bit code (though anyone could download sources and add this option). So if we need to perform 4 permutations in parallel using 64-bit instructions, we have to extend all involved bitmasks to 64 bits manually: <pre class="prettyprint"><code>uint64_t bit_permute_step(uint64_t x, uint64_t m, unsigned shift) { uint64_t t; t = ((x >> shift) ^ x) & m; x = (x ^ t) ^ (t << shift); return x; } uint64_t segregate4(uint64_t x) { // generated by http://programming.sirrida.de/calcperm.php, extended to 64-bit x = bit_permute_step(x, 0x2222222222222222ull, 1); x = bit_permute_step(x, 0x0c0c0c0c0c0c0c0cull, 2); x = bit_permute_step(x, 0x00f000f000f000f0ull, 4); return x; } </code></pre> Level of parallelism could be increased even more (8 or 16 permutations at once) with SSE instructions. (And recent versions of gcc can vectorize this code automatically). If parallelism is not required and data cache is not extensively used by other parts of the program, better alternative would be to use lookup table. Various LUT approacehes are already discussed in other answers, still some more could be said here: <ol> <li>The first and the last bits of 16-bit word are never permuted, we need to shuffle only bits 1..14. So (if we want to perform the task with single LUT access) it is enough to have a LUT with 16K entries which means 32K of memory.</li> <li>We could combine table lookup and computation approaches. Two lookups in a single 256-byte table could shuffle each source byte separately. After this we only need to exchange two middle 4-bit nibbles. This allows to keep lookup table small, uses only 2 memory accesses, and needs not too much calculations (i.e. balances calculations and memory accesses).</li> </ol> Here is implementation of second approach: <pre class="prettyprint"><code>#define B10(x) x+0x00, x+0x10, x+0x01, x+0x11 #define B32(x) B10(x+0x00), B10(x+0x20), B10(x+0x02), B10(x+0x22) #define B54(x) B32(x+0x00), B32(x+0x40), B32(x+0x04), B32(x+0x44) uint8_t lut[256] = {B54( 0x00), B54( 0x80), B54( 0x08), B54( 0x88)}; #undef B54 #undef B32 #undef B10 uint_fast16_t segregateLUT(uint_fast16_t x) { uint_fast16_t low = lut[x & 0x00ff]; low |= low << 4; uint_fast16_t high = lut[x >> 8] << 4; high |= high << 4; return (low & 0x0f0f) | (high & 0xf0f0); } </code></pre> But fastest approach (if portability is not an issue) is using <code>pext</code> instruction from BMI2 instruction set as noted by Nils Pipenbrinck. With a pair of 64-bit <code>pext</code> we could perform 4 16-bit shuffles in parallel. Since <code>pext</code> instruction is intended exactly for this kind of bit permutations, this approach easily outperforms all others.

How can I shuffle bits efficiently?

Tags:

I need to shuffle a 16 bit unsigned integer in a way that the even indexes land in the lower byte, and the odd indexes land in the upper byte.

input: fedcba98 76543210 (contiguously numbered)  output: fdb97531 eca86420 (even and odd separated)

My code looks like this at the moment:

typedef unsigned short u16;  u16 segregate(u16 x) {     u16 g = (x & 0x0001);     u16 h = (x & 0x0004) >> 1;     u16 i = (x & 0x0010) >> 2;     u16 j = (x & 0x0040) >> 3;     u16 k = (x & 0x0100) >> 4;     u16 l = (x & 0x0400) >> 5;     u16 m = (x & 0x1000) >> 6;     u16 n = (x & 0x4000) >> 7;      u16 o = (x & 0x0002) << 7;     u16 p = (x & 0x0008) << 6;     u16 q = (x & 0x0020) << 5;     u16 r = (x & 0x0080) << 4;     u16 s = (x & 0x0200) << 3;     u16 t = (x & 0x0800) << 2;     u16 u = (x & 0x2000) << 1;     u16 v = (x & 0x8000);      return g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v; }

I wonder if there is a more elegant solution than simply extracting and shifting each individual bit?

954

asked Sep 09 '14 17:09

fredoverflow

2 Answers

The table approach shown by others is the most portable version and is probably quite fast.

If you want to take advantage of special instruction sets there are some other options as well. For Intel Haswell and later for example the following approach can be used (requires the BMI2 instruction set extension):

unsigned segregate_bmi (unsigned arg) {   unsigned oddBits  = _pext_u32(arg,0x5555);   unsigned evenBits = _pext_u32(arg,0xaaaa);   return (oddBits | (evenBits << 8)); }

answered Oct 08 '22 06:10

Nils Pipenbrinck

There is a very convenient web resource that helps solving many bit permutation problems: Code generator for bit permutations. In this particular case feeding "0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15" to this page produces pretty fast code.

Unfortunately this code generator cannot produce 64-bit code (though anyone could download sources and add this option). So if we need to perform 4 permutations in parallel using 64-bit instructions, we have to extend all involved bitmasks to 64 bits manually:

uint64_t bit_permute_step(uint64_t x, uint64_t m, unsigned shift) {   uint64_t t;   t = ((x >> shift) ^ x) & m;   x = (x ^ t) ^ (t << shift);   return x; }  uint64_t segregate4(uint64_t x) { // generated by http://programming.sirrida.de/calcperm.php, extended to 64-bit   x = bit_permute_step(x, 0x2222222222222222ull, 1);   x = bit_permute_step(x, 0x0c0c0c0c0c0c0c0cull, 2);   x = bit_permute_step(x, 0x00f000f000f000f0ull, 4);   return x; }

Level of parallelism could be increased even more (8 or 16 permutations at once) with SSE instructions. (And recent versions of gcc can vectorize this code automatically).

If parallelism is not required and data cache is not extensively used by other parts of the program, better alternative would be to use lookup table. Various LUT approacehes are already discussed in other answers, still some more could be said here:

The first and the last bits of 16-bit word are never permuted, we need to shuffle only bits 1..14. So (if we want to perform the task with single LUT access) it is enough to have a LUT with 16K entries which means 32K of memory.
We could combine table lookup and computation approaches. Two lookups in a single 256-byte table could shuffle each source byte separately. After this we only need to exchange two middle 4-bit nibbles. This allows to keep lookup table small, uses only 2 memory accesses, and needs not too much calculations (i.e. balances calculations and memory accesses).

Here is implementation of second approach:

#define B10(x)          x+0x00,      x+0x10,      x+0x01,      x+0x11 #define B32(x)      B10(x+0x00), B10(x+0x20), B10(x+0x02), B10(x+0x22) #define B54(x)      B32(x+0x00), B32(x+0x40), B32(x+0x04), B32(x+0x44) uint8_t lut[256] = {B54(  0x00), B54(  0x80), B54(  0x08), B54(  0x88)}; #undef B54 #undef B32 #undef B10  uint_fast16_t segregateLUT(uint_fast16_t x) {   uint_fast16_t low = lut[x & 0x00ff];   low |= low << 4;   uint_fast16_t high = lut[x >> 8] << 4;   high |= high << 4;   return (low & 0x0f0f) | (high & 0xf0f0); }

But fastest approach (if portability is not an issue) is using pext instruction from BMI2 instruction set as noted by Nils Pipenbrinck. With a pair of 64-bit pext we could perform 4 16-bit shuffles in parallel. Since pext instruction is intended exactly for this kind of bit permutations, this approach easily outperforms all others.

answered Oct 08 '22 07:10

Evgeny Kluev

Related questions
                            
                                Resolving circular imports in celery and django
                            
                                bash script execute commands after ssh
                            
                                OpenCV for Python 3.x under Windows [duplicate]
                            
                                Does SqlBulkCopy automatically start a transaction?
                            
                                Can named arguments be used with Python enums?
                            
                                java.lang.NullPointerException at android.support.v7.widget.RecyclerView.onMeasure
                            
                                How to animate UITableViewCell height using auto-layout?
                            
                                Angular js Nesting custom directives
                            
                                Why is the function argument not overwritten on creating variable of same name inside the function?
                            
                                Asset Catalog Compiler Error - none of the input catalogs contained a matching app icon set named "AppIcon"
                            
                                How to send Laravel error responses as JSON
                            
                                iOS8/Swift: Tab Bar Item getting smaller on click

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With