Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to quickly count bits into separate bins in a series of ints on Sandy Bridge? [duplicate]

Update: Please read the code, it is NOT about counting bits in one int

Is it possible to improve performance of the following code with some clever assembler?

uint bit_counter[64];

void Count(uint64 bits) {
  bit_counter[0] += (bits >> 0) & 1;
  bit_counter[1] += (bits >> 1) & 1;
  // ..
  bit_counter[63] += (bits >> 63) & 1;
}

Count is in the inner-most loop of my algorithm.

Update: Architecture: x86-64, Sandy Bridge, so SSE4.2, AVX1 and older tech can be used, but not AVX2 or BMI1/2.

bits variable has almost random bits (close to half zeros and half ones)

like image 544
Łukasz Lew Avatar asked Oct 17 '11 12:10

Łukasz Lew


3 Answers

You could try doing it with SSE, incrementing 4 elements per iteration.

Warning: untested code follows...

#include <stdint.h>
#include <emmintrin.h>

uint32_t bit_counter[64] __attribute__ ((aligned(16)));
                     // make sure bit_counter array is 16 byte aligned for SSE

void Count_SSE(uint64 bits)
{
    const __m128i inc_table[16] = {
        _mm_set_epi32(0, 0, 0, 0),
        _mm_set_epi32(0, 0, 0, 1),
        _mm_set_epi32(0, 0, 1, 0),
        _mm_set_epi32(0, 0, 1, 1),
        _mm_set_epi32(0, 1, 0, 0),
        _mm_set_epi32(0, 1, 0, 1),
        _mm_set_epi32(0, 1, 1, 0),
        _mm_set_epi32(0, 1, 1, 1),
        _mm_set_epi32(1, 0, 0, 0),
        _mm_set_epi32(1, 0, 0, 1),
        _mm_set_epi32(1, 0, 1, 0),
        _mm_set_epi32(1, 0, 1, 1),
        _mm_set_epi32(1, 1, 0, 0),
        _mm_set_epi32(1, 1, 0, 1),
        _mm_set_epi32(1, 1, 1, 0),
        _mm_set_epi32(1, 1, 1, 1)
    };

    for (int i = 0; i < 64; i += 4)
    {
        __m128i vbit_counter = _mm_load_si128(&bit_counter[i]);
                                          // load 4 ints from bit_counter
        int index = (bits >> i) & 15;     // get next 4 bits
        __m128i vinc = inc_table[index];  // look up 4 increments from LUT
        vbit_counter = _mm_add_epi32(vbit_counter, vinc);
                                          // increment 4 elements of bit_counter
        _mm_store_si128(&bit_counter[i], vbit_counter);
    }                                     // store 4 updated ints
}

How it works: essentially all we are doing here is vectorizing the original loop so that we process 4 bits per loop iteration instead of 1. So we now have 16 loop iterations instead of 64. For each iteration we load 4 bits from bits, then use them as an index into a LUT which contains all possible combinations of 4 increments for the current 4 bits. We then add these 4 increments to the current 4 elements of bit_counter.

The number of loads and stores and adds is reduced by a factor of 4, but this will be offset somewhat by the LUT load and other housekeeping. You may still see a 2x speed up though. I'd be interested to know the result if you do decide to try it.

like image 153
Paul R Avatar answered Nov 08 '22 22:11

Paul R


Maybe you can do 8 at once, by taking 8 bits spaced 8 apart and keeping 8 uint64's for the counts. That's only 1 byte per single counter though, so you can only accumulate 255 invocations of count before you'd have to unpack those uint64's.

like image 7
harold Avatar answered Nov 08 '22 22:11

harold


Look at Bit Twiddling Hacks

  • Counting bits set
    • Counting bits set, naive way
    • Counting bits set by lookup table
    • Counting bits set, Brian Kernighan's way
    • Counting bits set in 12, 24, or 32-bit words using 64-bit instructions
    • Counting bits set, in parallel
    • Count bits set (rank) from the most-significant bit upto a given position
    • Select the bit position (from the most-significant bit) with the given count (rank)

Edit As for the 'bit position bucket accumulation' (bit_counter[]) I have a feeling that this might be a good case for valarrays + masking. That'd be a fair bit of coding+testing+profiling though. Let me know if you are really interested.

You could, these days, come very close to valarray behaviour using tied tuples (TR1, boost or C++11); I have a feeling it would come out being simpler to read and slower to compile.

like image 5
sehe Avatar answered Nov 08 '22 22:11

sehe