Fastest way to count bits [duplicate]

Tags:

assembly

Possible Duplicate:
How to count the number of set bits in a 32-bit integer?

Give a unsigned char type value,count the total bits in it.What's the fastest way? I wrote three function as below,what's the best way,and can someone come up with a faster one?(I just want the extremely fast one)

const int tbl[] =
{
#define B2(n)   n, n+1, n+1, n+2
#define B4(n) B2(n), B2(n+1), B2(n+1), B2(n+2)
#define B6(n) B4(n), B4(n+1), B4(n+1), B4(n+2)
    B6(0), B6(1), B6(1), B6(2)
};

char naivecount (unsigned char val)
{
    char cnt = 0;
    while (val)
    {
        cnt += (val & 1);
        val = val >> 1;
    }
    return cnt;
}

inline tableLookUp(int val)
{
    assert(val >= 0 && val <= 255);
    return tbl[val];
}

int asmCount(int val)
{
    int res = 0;
    asm volatile("xor %0, %0\n\t"
            "begin:\n\t"
            "cmp $0x0, %1\n\t"
            "jle end\n\t"
            "movl %1, %%ecx\n\t"
            "and $0x1, %%ecx\n\t"
            "addl %%ecx, %0\n\t"
            "shrl %1\n\t"
            "jmp begin\n\t"
            "end:"
            : "=r"(res)
            : "r" (val));
    return res;
}

EDIT:

I have test all the method,the fastest one is to use the popcntl instruction.In platform without the instruction,I will use table look-up.

745

asked Dec 23 '12 08:12

prehistoricpenguin

2 Answers

If you want to code it by hand, try this:

#include <stdint.h>

int popcnt8(uint8_t x) {

    x = (x & 0x55) + (x >> 1 & 0x55);
    x = (x & 0x33) + (x >> 2 & 0x33);
    x = (x & 0x0f) + (x >> 4 & 0x0f);

    return x;
}

on x86, this compiles to (AT&T-syntax):

popcnt8:
    movl    %edi, %eax
    shrb    %dil
    andl    $85, %eax
    andl    $85, %edi
    addl    %eax, %edi
    movl    %edi, %eax
    shrb    $2, %dil
    andl    $51, %eax
    andl    $51, %edi
    addl    %eax, %edi
    movl    %edi, %eax
    shrb    $4, %dil
    andl    $15, %eax
    addl    %edi, %eax
    movzbl  %al, %eax
    ret

Compare this to what gcc generates with the intrinsic:

#include <stdint.h>

int popcnt8_intrin(uint8_t x) { return __builtin_popcount(x); }

On x86 with SSE 4.2:

popcnt8_intrin:
movzbl  %dil, %eax
popcntl %eax, %eax
ret

which is not optimal; clang generates:

popcnt8_intrin:
    popcntl %edi,%eax
    ret

reducing the calculation to one (!) instruction.

On x86 without SSE 4.2:

popcnt8_intrin:
subq    $8, %rsp
movzbl  %dil, %edi
call    __popcountdi2
addq    $8, %rsp
ret

gcc essentially calls its library here. Not quite optimal. clang does a little better:

popcnt8_intrin:                         # @popcnt8_intrin
movl    %edi, %eax
shrl    %eax
andl    $85, %eax
subl    %eax, %edi
movl    %edi, %eax
andl    $858993459, %eax        # imm = 0x33333333
shrl    $2, %edi
andl    $858993459, %edi        # imm = 0x33333333
addl    %eax, %edi
movl    %edi, %eax
shrl    $4, %eax
addl    %edi, %eax
andl    $252645135, %eax        # imm = 0xF0F0F0F
imull   $16843009, %eax, %eax   # imm = 0x1010101
shrl    $24, %eax
ret

clang calculates popcnt for a whole 32 bit number. This is not optimal imho.

153

answered Oct 05 '22 03:10

fuz

Your assembler code would be faster if you didn't do so many compares and branches that vary from taken and not taken.

But clearly, the fastest method is to do a byte lookup, particularly as you are only dealing with 256 values (you can use the naive method to write a list of the values, then just have a static const table[256] = { ... }; return table[value]; in your function.

Benchmark the different solutions.

I wouldn't be surprised if your assembler code is slower than the compiler generated code!

Edit: Your assembler code would be slight faster by doing:

int asmCount(int val)
{
    int res = 0;
    asm volatile("begin:\n\t"
            "movl %1, %%ecx\n\t"
            "and $0x1, %%ecx\n\t"
            "addl %%ecx, %0\n\t"
            "shrl %1\n\t"
            "jnz begin\n\t"
            "end:"
            : "=r"(res)
            : "r" (val)
            : "ecx");      // Important: clobbers ecx!
    return res;
}

I removed the xor (res = 0 should do that anyways), and compare (sure, if val is zero, we execute a few extra instructions, but for anything with high bits set, it is much worse, since it's two extra instructions for every iteration, meaning potentially 16 extra instructions - and one of them a branch!), and changed the jump to a jnz at the end of the loop. It probably is roughly what the compiler generates in your first case. Trying to beat the compiler on simple code isn't that easy!

answered Oct 05 '22 01:10

Mats Petersson

Related questions
                            
                                Undefined reference to getaddrinfo
                            
                                Unexpected result in C program
                            
                                How to get the AST from YACC?
                            
                                How in portable C to seek forward when reading from a pipe
                            
                                scanf() curious behaviour!
                            
                                Why won't LD_PRELOAD work with Python?
                            
                                Why doesn't %zd printf format work in VS2010?
                            
                                Atomic reads in C
                            
                                Memory pools implementation in C
                            
                                Example of waitpid, WNOHANG, and SIGCHLD
                            
                                why "int main(anything_you_type)" doesnt produce any error?
                            
                                A mechanism to ensure exception propagation when mixing C and C++ code
                            
                                Visual studio indentaion of function arguments
                            
                                Write binary data with Haskell to be read by C?
                            
                                undefined reference to `gzopen` error
                            
                                Why does mode_t use 4 byte?
                            
                                cross platform sound API [closed]
                            
                                scanf dynamic allocation
                            
                                what is ## in c?
                            
                                Fastest way to find up to 6 consecutive 0 bits in a char array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With