Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up bit testing

I'm pondering at how to speed up bit testing in the following routine:

void histSubtractFromBits(uint64* cursor, uint16* hist){
    //traverse each bit of the 256-bit-long bitstring by splitting up into 4 bitsets
    std::bitset<64> a(*cursor);
    std::bitset<64> b(*(cursor+1));
    std::bitset<64> c(*(cursor+2));
    std::bitset<64> d(*(cursor+3));
    for(int bit = 0; bit < 64; bit++){
        hist[bit] -= a.test(bit);
    }
    for(int bit = 0; bit < 64; bit++){
        hist[bit+64] -= b.test(bit);
    }
    for(int bit = 0; bit < 64; bit++){
        hist[bit+128] -= c.test(bit);
    }
    for(int bit = 0; bit < 64; bit++){
        hist[bit+192] -= d.test(bit);
    }
}

The actual gcc implementation does a range-check for the bit argument, then &-s with a bitmask. I could do it without the bitsets and with my own bit-shifting / masking, but I'm fairly certain that won't yield any significant speedup (tell me if I'm wrong and why).

I'm not really familiar with the x86-64 assembly, but I am aware of a certain bit test instruction, and I am aware that it's theoretically possible to do inline assembly with gcc.

1) Do you think it at all worthwhile to write an inline-assembly analogue for the above code?

2) If yes, then how would I go about doing it, i.e. could you show me some basic starter code / samples to point me in the right direction?

like image 425
Greg Kramida Avatar asked Feb 13 '23 02:02

Greg Kramida


2 Answers

As far as I can tell, you basically iterate over each bit. As such, I'd imagine simply shifting and masking off the LSB every time should provide good performance. Something like:

uint64_t a = *cursor;
for(int bit = 0; a != 0; bit++, a >>= 1) {
    hist[bit] -= (a & 1);
}

Alternatively, if you expect only very few bits to be set and are happy with gcc specific stuff, you could use __builtin_ffsll

uint64_t a = *cursor;
int next;
for(int bit = 0; (next = __builtin_ffsll(a)) != 0; ) {
    bit += next;
    hist[bit - 1] -= 1;
    a >>= next;
}

The idea should be fine, but no warranty for the actual code :)

Update: code using vector extensions:

typedef short v8hi __attribute__ ((vector_size (16)));

static v8hi table[256];

void histSubtractFromBits(uint64_t* cursor, uint16_t* hist)
{
    uint8_t* cursor_tmp = (uint8_t*)cursor;
    v8hi* hist_tmp = (v8hi*)hist;
    for(int i = 0; i < 32; i++, cursor_tmp++, hist_tmp++)
    {
        *hist_tmp -= table[*cursor_tmp];
    }
}

void setup_table()
{
    for(int i = 0; i < 256; i++)
    {
        for(int j = 0; j < 8; j++)
        {
            table[i][j] = (i >> j) & 1;
        }
    }
}

This will be compiled to SSE instructions if available, for example I get:

        leaq    32(%rdi), %rdx
        .p2align 4,,10
        .p2align 3
.L2:
        movzbl  (%rdi), %eax
        addq    $1, %rdi
        movdqa  (%rsi), %xmm0
        salq    $4, %rax
        psubw   table(%rax), %xmm0
        movdqa  %xmm0, (%rsi)
        addq    $16, %rsi
        cmpq    %rdx, %rdi
        jne     .L2

Of course this approach relies on the table being in cache.

like image 108
Jester Avatar answered Feb 23 '23 02:02

Jester


Another suggestion is to combine data caching, registers and loop unrolling:

// Assuming your processor has 64-bit words
void histSubtractFromBits(uint64_t const * cursor, uint16* hist)
{
    register uint64_t a = *cursor++;
    register uint64_t b = *cursor++;
    register uint64_t c = *cursor++;
    register uint64_t d = *cursor++;
    register unsigned int i = 0;
    for (i = 0; i < (sizeof(*cursor) * CHAR_BIT; ++i)
    {
        hist[i +   0] += a & 1;
        hist[i +  64] += b & 1;
        hist[i + 128] += c & 1;
        hist[i + 192] += d & 1;
        a >>= 1;
        b >>= 1;
        c >>= 1;
        d >>= 1;
    }
}

I'm not sure if you gain any more performance by reordering the instructions like this:

    hist[i +   0] += a & 1;
    a >>= 1;

You could try both ways and compare the assembly language for both.

One of the ideas here is to maximize the register usage. The values to test are loaded into registers and then the testing begins.

like image 22
Thomas Matthews Avatar answered Feb 23 '23 00:02

Thomas Matthews