Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to count bits [duplicate]

Tags:

c

assembly

Possible Duplicate:
How to count the number of set bits in a 32-bit integer?

Give a unsigned char type value,count the total bits in it.What's the fastest way? I wrote three function as below,what's the best way,and can someone come up with a faster one?(I just want the extremely fast one)

const int tbl[] =
{
#define B2(n)   n, n+1, n+1, n+2
#define B4(n) B2(n), B2(n+1), B2(n+1), B2(n+2)
#define B6(n) B4(n), B4(n+1), B4(n+1), B4(n+2)
    B6(0), B6(1), B6(1), B6(2)
};

char naivecount (unsigned char val)
{
    char cnt = 0;
    while (val)
    {
        cnt += (val & 1);
        val = val >> 1;
    }
    return cnt;
}

inline tableLookUp(int val)
{
    assert(val >= 0 && val <= 255);
    return tbl[val];
}

int asmCount(int val)
{
    int res = 0;
    asm volatile("xor %0, %0\n\t"
            "begin:\n\t"
            "cmp $0x0, %1\n\t"
            "jle end\n\t"
            "movl %1, %%ecx\n\t"
            "and $0x1, %%ecx\n\t"
            "addl %%ecx, %0\n\t"
            "shrl %1\n\t"
            "jmp begin\n\t"
            "end:"
            : "=r"(res)
            : "r" (val));
    return res;
}

EDIT:

I have test all the method,the fastest one is to use the popcntl instruction.In platform without the instruction,I will use table look-up.

like image 745
prehistoricpenguin Avatar asked Dec 23 '12 08:12

prehistoricpenguin


People also ask

How do you efficiently count the number of 1 bits in a 32 bit number?

We can use binary logics to remove the loop, which is a lot faster. The first line counts the number of ones every two bits and store them using two bits. For example, 01101100 -> 01011000. The original can be grouped to 01 10 11 00, the number of ones is 1, 1, 2, 0 then written in binary is 01011000.

How can you tell if multiple bits are set?

Check if a number has two adjacent set bits in C++ To check this type of number, the idea is very simple. We will shift the number 1 bit, then do bitwise AND. If the bitwise AND result is non-zero, then there must be some consecutive 1s.

How do you make all bits 1?

Setting all bits can be done by using the | (OR) bit operator with 1s for each of the bits. This is because 1 OR with any number sets the number as 1. We show how to do this with 8-bit, 16-bit, and 32-bit registers.


2 Answers

If you want to code it by hand, try this:

#include <stdint.h>

int popcnt8(uint8_t x) {

    x = (x & 0x55) + (x >> 1 & 0x55);
    x = (x & 0x33) + (x >> 2 & 0x33);
    x = (x & 0x0f) + (x >> 4 & 0x0f);

    return x;
}

on x86, this compiles to (AT&T-syntax):

popcnt8:
    movl    %edi, %eax
    shrb    %dil
    andl    $85, %eax
    andl    $85, %edi
    addl    %eax, %edi
    movl    %edi, %eax
    shrb    $2, %dil
    andl    $51, %eax
    andl    $51, %edi
    addl    %eax, %edi
    movl    %edi, %eax
    shrb    $4, %dil
    andl    $15, %eax
    addl    %edi, %eax
    movzbl  %al, %eax
    ret

Compare this to what gcc generates with the intrinsic:

#include <stdint.h>

int popcnt8_intrin(uint8_t x) { return __builtin_popcount(x); }

On x86 with SSE 4.2:

popcnt8_intrin:
movzbl  %dil, %eax
popcntl %eax, %eax
ret

which is not optimal; clang generates:

popcnt8_intrin:
    popcntl %edi,%eax
    ret

reducing the calculation to one (!) instruction.

On x86 without SSE 4.2:

popcnt8_intrin:
subq    $8, %rsp
movzbl  %dil, %edi
call    __popcountdi2
addq    $8, %rsp
ret

gcc essentially calls its library here. Not quite optimal. clang does a little better:

popcnt8_intrin:                         # @popcnt8_intrin
movl    %edi, %eax
shrl    %eax
andl    $85, %eax
subl    %eax, %edi
movl    %edi, %eax
andl    $858993459, %eax        # imm = 0x33333333
shrl    $2, %edi
andl    $858993459, %edi        # imm = 0x33333333
addl    %eax, %edi
movl    %edi, %eax
shrl    $4, %eax
addl    %edi, %eax
andl    $252645135, %eax        # imm = 0xF0F0F0F
imull   $16843009, %eax, %eax   # imm = 0x1010101
shrl    $24, %eax
ret

clang calculates popcnt for a whole 32 bit number. This is not optimal imho.

like image 153
fuz Avatar answered Oct 05 '22 03:10

fuz


Your assembler code would be faster if you didn't do so many compares and branches that vary from taken and not taken.

But clearly, the fastest method is to do a byte lookup, particularly as you are only dealing with 256 values (you can use the naive method to write a list of the values, then just have a static const table[256] = { ... }; return table[value]; in your function.

Benchmark the different solutions.

I wouldn't be surprised if your assembler code is slower than the compiler generated code!

Edit: Your assembler code would be slight faster by doing:

int asmCount(int val)
{
    int res = 0;
    asm volatile("begin:\n\t"
            "movl %1, %%ecx\n\t"
            "and $0x1, %%ecx\n\t"
            "addl %%ecx, %0\n\t"
            "shrl %1\n\t"
            "jnz begin\n\t"
            "end:"
            : "=r"(res)
            : "r" (val)
            : "ecx");      // Important: clobbers ecx!
    return res;
}

I removed the xor (res = 0 should do that anyways), and compare (sure, if val is zero, we execute a few extra instructions, but for anything with high bits set, it is much worse, since it's two extra instructions for every iteration, meaning potentially 16 extra instructions - and one of them a branch!), and changed the jump to a jnz at the end of the loop. It probably is roughly what the compiler generates in your first case. Trying to beat the compiler on simple code isn't that easy!

like image 43
Mats Petersson Avatar answered Oct 05 '22 01:10

Mats Petersson