I stumbled on __builtin_popcount for gcc after I had written my own bit count routines. But when I switched to __builtin_popcount my software actually ran slower. I'm on Unbutu on an Intel Core i3-4130T CPU @ 2.90GHz. I built a performance test to see what gives. It looks like this:
#include <iostream>
#include <sys/time.h>
#include <stdint.h>
using namespace std;
const int bitCount[256] = {
0,1,1,2,1,2,2,3, 1,2,2,3,2,3,3,4, 1,2,2,3,2,3,3,4, 2,3,3,4,3,4,4,5,
1,2,2,3,2,3,3,4, 2,3,3,4,3,4,4,5, 2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6,
1,2,2,3,2,3,3,4, 2,3,3,4,3,4,4,5, 2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6, 3,4,4,5,4,5,5,6, 4,5,5,6,5,6,6,7,
1,2,2,3,2,3,3,4, 2,3,3,4,3,4,4,5, 2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6, 3,4,4,5,4,5,5,6, 4,5,5,6,5,6,6,7,
2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6, 3,4,4,5,4,5,5,6, 4,5,5,6,5,6,6,7,
3,4,4,5,4,5,5,6, 4,5,5,6,5,6,6,7, 4,5,5,6,5,6,6,7, 5,6,6,7,6,7,7,8
};
const uint32_t m32_0001 = 0x000000ffu;
const uint32_t m32_0010 = 0x0000ff00u;
const uint32_t m32_0100 = 0x00ff0000u;
const uint32_t m32_1000 = 0xff000000u;
inline int countBits(uint32_t bitField)
{
return
bitCount[(bitField & m32_0001) ] +
bitCount[(bitField & m32_0010) >> 8] +
bitCount[(bitField & m32_0100) >> 16] +
bitCount[(bitField & m32_1000) >> 24];
}
inline long long currentTime() {
struct timeval ct;
gettimeofday(&ct, NULL);
return ct.tv_sec * 1000000LL + ct.tv_usec;
}
int main() {
long long start, delta, sum;
start = currentTime();
sum = 0;
for(unsigned i = 0; i < 100000000; ++i)
sum += countBits(i);
delta = currentTime() - start;
cout << "countBits : sum=" << sum << ": time (usec)=" << delta << endl;
start = currentTime();
sum = 0;
for(unsigned i = 0; i < 100000000; ++i)
sum += __builtin_popcount(i);
delta = currentTime() - start;
cout << "__builtin_popcount: sum=" << sum << ": time (usec)=" << delta << endl;
start = currentTime();
sum = 0;
for(unsigned i = 0; i < 100000000; ++i) {
int count;
asm("popcnt %1,%0" : "=r"(count) : "rm"(i) : "cc");
sum += count;
}
delta = currentTime() - start;
cout << "assembler : sum=" << sum << ": time (usec)=" << delta << endl;
return 0;
}
At first I ran this with an older compiler:
> g++ --version | head -1
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
> cat /proc/cpuinfo | grep 'model name' | head -1
model name : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> g++ -O3 popcountTest.cpp
> ./a.out
countBits : sum=1314447104: time (usec)=148506
__builtin_popcount: sum=1314447104: time (usec)=345122
assembler : sum=1314447104: time (usec)=138036
As you can see, the table-based countBits is almost as fast as the assembler and far-faster than __builtin_popcount. Then I tried a newer compiler on a different machine type (same processor -- and I think the mother board's the same too):
> g++ --version | head -1
g++ (Ubuntu 7.3.0-16ubuntu3) 7.3.0
> cat /proc/cpuinfo | grep 'model name' | head -1
model name : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> g++ -O3 popcountTest.cpp
> ./a.out
countBits : sum=1314447104: time (usec)=164247
__builtin_popcount: sum=1314447104: time (usec)=345167
assembler : sum=1314447104: time (usec)=138028
Curiously, the older compiler optimized my countBits function better than the newer compiler, but it still compares favorably with the assembler. Clearly since the assembler line compiles and runs, my processor supports popcount, but why then is __builtin_popcount more than two times slower? And how can my own routine possibly compete with the silicon-based popcount? I'm having the same experience with other routines for finding the first set bit, etc. My routines are all significantly faster than the GNU "builtin" equivalents.
(BTW, I have no clue how to write assembler. I just found that line on some web page and it miraculously seemed to work.)
In this comment, it's mentioned that the complexity of __builtin__popcount for any integer j with j = O(2 N) is O(N) (i.e ) instead of O(1).
__builtin_popcount(x): This function is used to count the number of one's(set bits) in an integer.
The number of odd integers in the nth row of Pascal's triangle equals 2b where b is the number of 1's in the binary representation of n. The function that takes a bit stream and returns the number of 1's is commonly called popcount , short for population count.
Without specifying an appropriate "-march" on the command line gcc generates a call to the __popcountdi2
function rather than the popcnt
instruction. See: https://godbolt.org/z/z1BihM
POPCNT is supported by Intel since Nehalem and AMD since Barcelona according to wikipedia: https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT
I thought it might be useful to share the new performance results after adding -march=native to the compile line (as suggested by Mat and Alan Birtles) which enables use of the popcount machine instruction. The results are different depending on compiler version. Here's the older compiler:
> g++ --version | head -1
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
> cat /proc/cpuinfo | grep 'model name' | head -1
model name : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> g++ -march=native -O3 popcountTest.cpp
> ./a.out
countBits : sum=1314447104: time (usec)=163947
__builtin_popcount: sum=1314447104: time (usec)=138046
assembler : sum=1314447104: time (usec)=138036
And here's the newer compiler:
> g++ --version | head -1
g++ (Ubuntu 7.3.0-16ubuntu3) 7.3.0
> cat /proc/cpuinfo | grep 'model name' | head -1
model name : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> g++ -march=native -O3 popcountTest.cpp
> ./a.out
countBits : sum=1314447104: time (usec)=163133
__builtin_popcount: sum=1314447104: time (usec)=73987
assembler : sum=1314447104: time (usec)=138036
Observations:
Adding -march=native to the command line of the older g++ compiler improved the performance of __builtin_popcount to equal that of the assembler, and SLOWED my countbits routine by about 15%.
Adding -march=native to the command line of the newer g++ compiler caused the performance of __builtin_popcount to surpass that of the assembler. I presume this has something to do with the stack variable I used with the assembler, though I'm not sure. There was no effect on my countBits performance (which as stated in my question, was already slower with this newer compiler.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With