Update: Please read the code, it is NOT about counting bits in one int
Is it possible to improve performance of the following code with some clever assembler?
uint bit_counter[64];
void Count(uint64 bits) {
bit_counter[0] += (bits >> 0) & 1;
bit_counter[1] += (bits >> 1) & 1;
// ..
bit_counter[63] += (bits >> 63) & 1;
}
Count
is in the inner-most loop of my algorithm.
Update: Architecture: x86-64, Sandy Bridge, so SSE4.2, AVX1 and older tech can be used, but not AVX2 or BMI1/2.
bits
variable has almost random bits (close to half zeros and half ones)
You could try doing it with SSE, incrementing 4 elements per iteration.
Warning: untested code follows...
#include <stdint.h>
#include <emmintrin.h>
uint32_t bit_counter[64] __attribute__ ((aligned(16)));
// make sure bit_counter array is 16 byte aligned for SSE
void Count_SSE(uint64 bits)
{
const __m128i inc_table[16] = {
_mm_set_epi32(0, 0, 0, 0),
_mm_set_epi32(0, 0, 0, 1),
_mm_set_epi32(0, 0, 1, 0),
_mm_set_epi32(0, 0, 1, 1),
_mm_set_epi32(0, 1, 0, 0),
_mm_set_epi32(0, 1, 0, 1),
_mm_set_epi32(0, 1, 1, 0),
_mm_set_epi32(0, 1, 1, 1),
_mm_set_epi32(1, 0, 0, 0),
_mm_set_epi32(1, 0, 0, 1),
_mm_set_epi32(1, 0, 1, 0),
_mm_set_epi32(1, 0, 1, 1),
_mm_set_epi32(1, 1, 0, 0),
_mm_set_epi32(1, 1, 0, 1),
_mm_set_epi32(1, 1, 1, 0),
_mm_set_epi32(1, 1, 1, 1)
};
for (int i = 0; i < 64; i += 4)
{
__m128i vbit_counter = _mm_load_si128(&bit_counter[i]);
// load 4 ints from bit_counter
int index = (bits >> i) & 15; // get next 4 bits
__m128i vinc = inc_table[index]; // look up 4 increments from LUT
vbit_counter = _mm_add_epi32(vbit_counter, vinc);
// increment 4 elements of bit_counter
_mm_store_si128(&bit_counter[i], vbit_counter);
} // store 4 updated ints
}
How it works: essentially all we are doing here is vectorizing the original loop so that we process 4 bits per loop iteration instead of 1. So we now have 16 loop iterations instead of 64. For each iteration we load 4 bits from bits
, then use them as an index into a LUT which contains all possible combinations of 4 increments for the current 4 bits. We then add these 4 increments to the current 4 elements of bit_counter.
The number of loads and stores and adds is reduced by a factor of 4, but this will be offset somewhat by the LUT load and other housekeeping. You may still see a 2x speed up though. I'd be interested to know the result if you do decide to try it.
Maybe you can do 8 at once, by taking 8 bits spaced 8 apart and keeping 8 uint64's for the counts. That's only 1 byte per single counter though, so you can only accumulate 255 invocations of count
before you'd have to unpack those uint64's.
Look at Bit Twiddling Hacks
Edit As for the 'bit position bucket accumulation' (bit_counter[]
) I have a feeling that this might be a good case for valarrays + masking. That'd be a fair bit of coding+testing+profiling though. Let me know if you are really interested.
You could, these days, come very close to valarray behaviour using tied tuples (TR1, boost or C++11); I have a feeling it would come out being simpler to read and slower to compile.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With