Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a more efficient way of expanding a char to an uint64_t?

I want to inflate an unsigned char to an uint64_t by repeating each bit 8 times. E.g.

char -> uint64_t
0x00 -> 0x00
0x01 -> 0xFF
0x02 -> 0xFF00
0x03 -> 0xFFFF
0xAA -> 0xFF00FF00FF00FF00

I currently have the following implementation, using bit shifts to test if a bit is set, to accomplish this:

#include <stdint.h>
#include <inttypes.h>   

#define BIT_SET(var, pos) ((var) & (1 << (pos)))

static uint64_t inflate(unsigned char a)
{
    uint64_t MASK = 0xFF;
    uint64_t result = 0;
    for (int i = 0; i < 8; i++) {
        if (BIT_SET(a, i))
            result |= (MASK << (8 * i));    
    }

    return result;
} 

However, I'm fairly new to C, so this fiddling with individual bits makes me a little vary that there might be a better (i.e. more efficient) way of doing this.

EDIT TO ADD
Ok, so after trying out the table lookup solution, here are the results. However, keep in mind that I didn't test the routine directly, but rather as part of bigger function (a multiplication of binary matrices to be precise), so this might have affected how the results turned out. So, on my computer, when multiplying a million 8x8 matrices, and compiled with:

  gcc -O2 -Wall -std=c99 foo.c

I got

./a.out original
real    0m0.127s
user    0m0.124s
sys     0m0.000s

./a.out table_lookup
real    0m0.012s
user    0m0.012s
sys     0m0.000s

So at least on my machine (a virtual machine 64 bit Linux Mint I should mention), the table lookup approach seems to provide a roughly 10-times speed-up, so I will accept that as the answer.

like image 967
hakoja Avatar asked Jan 11 '13 09:01

hakoja


3 Answers

If you're looking for efficiency use a lookup table: a static array of 256 entries, each already holding the required result. You can use your code above to generate it.

like image 160
Nitzan Shaked Avatar answered Oct 09 '22 20:10

Nitzan Shaked


In selected architectures (SSE,Neon) there are fast vector operations that can speed up this task or are designed to do this. Without special instructions the suggested look up table approach is both the fastest and most portable.

If the 2k size is an issue, parallel vector arithmetic operations can be simulated:

static uint64_t inflate_parallel(unsigned char a) {
  uint64_t vector = a * 0x0101010101010101ULL;
  // replicate the word all over qword
  // A5 becomes A5 A5 A5 A5 A5 A5 A5 A5
  vector &= 0x8040201008040201;  // becomes 80 00 20 00 00 04 00 01 <-- 
  vector += 0x00406070787c7e7f;  // becomes 80 40 80 70 78 80 7e 80
                                 // MSB is correct
  vector = (vector >> 7) & 0x0101010101010101ULL;  // LSB is correct
  return vector * 255;                             // all bits correct
}

EDIT: 2^31 iterations, (four time unroll to mitigate loop evaluation)

time ./parallel            time ./original            time ./lookup
real        0m2.038s       real       0m14.161s       real      0m1.436s
user        0m2.030s       user       0m14.120s       user      0m1.430s
sys         0m0.000s       sys        0m0.000s        sys       0m0.000s

That's about 7x speedup, while the lookup table gives ~10x speedup

like image 6
Aki Suihkonen Avatar answered Oct 09 '22 19:10

Aki Suihkonen


You should profile what your code does, before worrying about optimising it.

On my compiler locally, your code gets entirely inlined, unrolled and turned into 8 constant test + or instructions when the value is unknown, and turned into a constant when the value is known at compile time. I could probably marginally improve it by removing a few branches, but the compiler is doing a reasonable job on its own.

Optimising the loop is then a bit pointless. A table lookup might be more efficient, but would probably prevent the compiler from making optimisations itself.

like image 3
JasonD Avatar answered Oct 09 '22 21:10

JasonD