Possible Duplicate:
How to count the number of set bits in a 32-bit integer?
Give a unsigned char type value,count the total bits in it.What's the fastest way? I wrote three function as below,what's the best way,and can someone come up with a faster one?(I just want the extremely fast one)
const int tbl[] =
{
#define B2(n) n, n+1, n+1, n+2
#define B4(n) B2(n), B2(n+1), B2(n+1), B2(n+2)
#define B6(n) B4(n), B4(n+1), B4(n+1), B4(n+2)
B6(0), B6(1), B6(1), B6(2)
};
char naivecount (unsigned char val)
{
char cnt = 0;
while (val)
{
cnt += (val & 1);
val = val >> 1;
}
return cnt;
}
inline tableLookUp(int val)
{
assert(val >= 0 && val <= 255);
return tbl[val];
}
int asmCount(int val)
{
int res = 0;
asm volatile("xor %0, %0\n\t"
"begin:\n\t"
"cmp $0x0, %1\n\t"
"jle end\n\t"
"movl %1, %%ecx\n\t"
"and $0x1, %%ecx\n\t"
"addl %%ecx, %0\n\t"
"shrl %1\n\t"
"jmp begin\n\t"
"end:"
: "=r"(res)
: "r" (val));
return res;
}
I have test all the method,the fastest one is to use the popcntl
instruction.In platform without the instruction,I will use table look-up.
We can use binary logics to remove the loop, which is a lot faster. The first line counts the number of ones every two bits and store them using two bits. For example, 01101100 -> 01011000. The original can be grouped to 01 10 11 00, the number of ones is 1, 1, 2, 0 then written in binary is 01011000.
Check if a number has two adjacent set bits in C++ To check this type of number, the idea is very simple. We will shift the number 1 bit, then do bitwise AND. If the bitwise AND result is non-zero, then there must be some consecutive 1s.
Setting all bits can be done by using the | (OR) bit operator with 1s for each of the bits. This is because 1 OR with any number sets the number as 1. We show how to do this with 8-bit, 16-bit, and 32-bit registers.
If you want to code it by hand, try this:
#include <stdint.h>
int popcnt8(uint8_t x) {
x = (x & 0x55) + (x >> 1 & 0x55);
x = (x & 0x33) + (x >> 2 & 0x33);
x = (x & 0x0f) + (x >> 4 & 0x0f);
return x;
}
on x86, this compiles to (AT&T-syntax):
popcnt8:
movl %edi, %eax
shrb %dil
andl $85, %eax
andl $85, %edi
addl %eax, %edi
movl %edi, %eax
shrb $2, %dil
andl $51, %eax
andl $51, %edi
addl %eax, %edi
movl %edi, %eax
shrb $4, %dil
andl $15, %eax
addl %edi, %eax
movzbl %al, %eax
ret
Compare this to what gcc generates with the intrinsic:
#include <stdint.h>
int popcnt8_intrin(uint8_t x) { return __builtin_popcount(x); }
On x86 with SSE 4.2:
popcnt8_intrin:
movzbl %dil, %eax
popcntl %eax, %eax
ret
which is not optimal; clang generates:
popcnt8_intrin:
popcntl %edi,%eax
ret
reducing the calculation to one (!) instruction.
On x86 without SSE 4.2:
popcnt8_intrin:
subq $8, %rsp
movzbl %dil, %edi
call __popcountdi2
addq $8, %rsp
ret
gcc essentially calls its library here. Not quite optimal. clang does a little better:
popcnt8_intrin: # @popcnt8_intrin
movl %edi, %eax
shrl %eax
andl $85, %eax
subl %eax, %edi
movl %edi, %eax
andl $858993459, %eax # imm = 0x33333333
shrl $2, %edi
andl $858993459, %edi # imm = 0x33333333
addl %eax, %edi
movl %edi, %eax
shrl $4, %eax
addl %edi, %eax
andl $252645135, %eax # imm = 0xF0F0F0F
imull $16843009, %eax, %eax # imm = 0x1010101
shrl $24, %eax
ret
clang calculates popcnt for a whole 32 bit number. This is not optimal imho.
Your assembler code would be faster if you didn't do so many compares and branches that vary from taken and not taken.
But clearly, the fastest method is to do a byte lookup, particularly as you are only dealing with 256 values (you can use the naive method to write a list of the values, then just have a static const table[256] = { ... }; return table[value];
in your function.
Benchmark the different solutions.
I wouldn't be surprised if your assembler code is slower than the compiler generated code!
Edit: Your assembler code would be slight faster by doing:
int asmCount(int val)
{
int res = 0;
asm volatile("begin:\n\t"
"movl %1, %%ecx\n\t"
"and $0x1, %%ecx\n\t"
"addl %%ecx, %0\n\t"
"shrl %1\n\t"
"jnz begin\n\t"
"end:"
: "=r"(res)
: "r" (val)
: "ecx"); // Important: clobbers ecx!
return res;
}
I removed the xor (res = 0 should do that anyways), and compare (sure, if val is zero, we execute a few extra instructions, but for anything with high bits set, it is much worse, since it's two extra instructions for every iteration, meaning potentially 16 extra instructions - and one of them a branch!), and changed the jump to a jnz at the end of the loop. It probably is roughly what the compiler generates in your first case. Trying to beat the compiler on simple code isn't that easy!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With