I stumbled on __builtin_popcount for gcc after I had written my own bit count routines. But when I switched to __builtin_popcount my software actually ran slower. I'm on Unbutu on an Intel Core i3-4130T CPU @ 2.90GHz. I built a performance test to see what gives. It looks like this: <pre class="prettyprint"><code>#include <iostream> #include <sys/time.h> #include <stdint.h> using namespace std; const int bitCount[256] = { 0,1,1,2,1,2,2,3, 1,2,2,3,2,3,3,4, 1,2,2,3,2,3,3,4, 2,3,3,4,3,4,4,5, 1,2,2,3,2,3,3,4, 2,3,3,4,3,4,4,5, 2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6, 1,2,2,3,2,3,3,4, 2,3,3,4,3,4,4,5, 2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6, 2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6, 3,4,4,5,4,5,5,6, 4,5,5,6,5,6,6,7, 1,2,2,3,2,3,3,4, 2,3,3,4,3,4,4,5, 2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6, 2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6, 3,4,4,5,4,5,5,6, 4,5,5,6,5,6,6,7, 2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6, 3,4,4,5,4,5,5,6, 4,5,5,6,5,6,6,7, 3,4,4,5,4,5,5,6, 4,5,5,6,5,6,6,7, 4,5,5,6,5,6,6,7, 5,6,6,7,6,7,7,8 }; const uint32_t m32_0001 = 0x000000ffu; const uint32_t m32_0010 = 0x0000ff00u; const uint32_t m32_0100 = 0x00ff0000u; const uint32_t m32_1000 = 0xff000000u; inline int countBits(uint32_t bitField) { return bitCount[(bitField & m32_0001) ] + bitCount[(bitField & m32_0010) >> 8] + bitCount[(bitField & m32_0100) >> 16] + bitCount[(bitField & m32_1000) >> 24]; } inline long long currentTime() { struct timeval ct; gettimeofday(&ct, NULL); return ct.tv_sec * 1000000LL + ct.tv_usec; } int main() { long long start, delta, sum; start = currentTime(); sum = 0; for(unsigned i = 0; i < 100000000; ++i) sum += countBits(i); delta = currentTime() - start; cout << "countBits : sum=" << sum << ": time (usec)=" << delta << endl; start = currentTime(); sum = 0; for(unsigned i = 0; i < 100000000; ++i) sum += __builtin_popcount(i); delta = currentTime() - start; cout << "__builtin_popcount: sum=" << sum << ": time (usec)=" << delta << endl; start = currentTime(); sum = 0; for(unsigned i = 0; i < 100000000; ++i) { int count; asm("popcnt %1,%0" : "=r"(count) : "rm"(i) : "cc"); sum += count; } delta = currentTime() - start; cout << "assembler : sum=" << sum << ": time (usec)=" << delta << endl; return 0; } </code></pre> At first I ran this with an older compiler: <pre class="prettyprint"><code>> g++ --version | head -1 g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4 > cat /proc/cpuinfo | grep 'model name' | head -1 model name : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz > g++ -O3 popcountTest.cpp > ./a.out countBits : sum=1314447104: time (usec)=148506 __builtin_popcount: sum=1314447104: time (usec)=345122 assembler : sum=1314447104: time (usec)=138036 </code></pre> As you can see, the table-based countBits is almost as fast as the assembler and far-faster than __builtin_popcount. Then I tried a newer compiler on a different machine type (same processor -- and I think the mother board's the same too): <pre class="prettyprint"><code>> g++ --version | head -1 g++ (Ubuntu 7.3.0-16ubuntu3) 7.3.0 > cat /proc/cpuinfo | grep 'model name' | head -1 model name : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz > g++ -O3 popcountTest.cpp > ./a.out countBits : sum=1314447104: time (usec)=164247 __builtin_popcount: sum=1314447104: time (usec)=345167 assembler : sum=1314447104: time (usec)=138028 </code></pre> Curiously, the older compiler optimized my countBits function better than the newer compiler, but it still compares favorably with the assembler. Clearly since the assembler line compiles and runs, my processor supports popcount, but why then is __builtin_popcount more than two times slower? And how can my own routine possibly compete with the silicon-based popcount? I'm having the same experience with other routines for finding the first set bit, etc. My routines are all significantly faster than the GNU "builtin" equivalents. (BTW, I have no clue how to write assembler. I just found that line on some web page and it miraculously seemed to work.)

Without specifying an appropriate "-march" on the command line gcc generates a call to the <code>__popcountdi2</code> function rather than the <code>popcnt</code> instruction. See: https://godbolt.org/z/z1BihM POPCNT is supported by Intel since Nehalem and AMD since Barcelona according to wikipedia: https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT

Why is __builtin_popcount slower than my own bit counting function?

Tags:

c++

gcc

I stumbled on __builtin_popcount for gcc after I had written my own bit count routines. But when I switched to __builtin_popcount my software actually ran slower. I'm on Unbutu on an Intel Core i3-4130T CPU @ 2.90GHz. I built a performance test to see what gives. It looks like this:

#include <iostream>
#include <sys/time.h>
#include <stdint.h>

using namespace std;

const int bitCount[256] = {
    0,1,1,2,1,2,2,3,  1,2,2,3,2,3,3,4,  1,2,2,3,2,3,3,4,  2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,  2,3,3,4,3,4,4,5,  2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,  2,3,3,4,3,4,4,5,  2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,  3,4,4,5,4,5,5,6,  4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,  2,3,3,4,3,4,4,5,  2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,  3,4,4,5,4,5,5,6,  4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,  3,4,4,5,4,5,5,6,  4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,  4,5,5,6,5,6,6,7,  4,5,5,6,5,6,6,7,  5,6,6,7,6,7,7,8
};

const uint32_t m32_0001 = 0x000000ffu;
const uint32_t m32_0010 = 0x0000ff00u;
const uint32_t m32_0100 = 0x00ff0000u;
const uint32_t m32_1000 = 0xff000000u;

inline int countBits(uint32_t bitField)
{
    return
        bitCount[(bitField & m32_0001)      ] +
        bitCount[(bitField & m32_0010) >>  8] +
        bitCount[(bitField & m32_0100) >> 16] +
        bitCount[(bitField & m32_1000) >> 24];
}

inline long long currentTime() {
    struct timeval ct;
    gettimeofday(&ct, NULL);
    return ct.tv_sec * 1000000LL + ct.tv_usec;
}

int main() {
    long long start, delta, sum;

    start = currentTime();
    sum = 0;
    for(unsigned i = 0; i < 100000000; ++i)
        sum += countBits(i);
    delta = currentTime() - start;
    cout << "countBits         : sum=" << sum << ": time (usec)=" << delta << endl;

    start = currentTime();
    sum = 0;
    for(unsigned i = 0; i < 100000000; ++i)
        sum += __builtin_popcount(i);
    delta = currentTime() - start;
    cout << "__builtin_popcount: sum=" << sum << ": time (usec)=" << delta << endl;

    start = currentTime();
    sum = 0;
    for(unsigned i = 0; i < 100000000; ++i) {
        int count;
        asm("popcnt %1,%0" : "=r"(count) : "rm"(i) : "cc");
        sum += count;
    }
    delta = currentTime() - start;
    cout << "assembler         : sum=" << sum << ": time (usec)=" << delta << endl;

    return 0;
}

At first I ran this with an older compiler:

> g++ --version | head -1
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
> cat /proc/cpuinfo | grep 'model name' | head -1
model name      : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> g++ -O3 popcountTest.cpp
> ./a.out
countBits         : sum=1314447104: time (usec)=148506
__builtin_popcount: sum=1314447104: time (usec)=345122
assembler         : sum=1314447104: time (usec)=138036

As you can see, the table-based countBits is almost as fast as the assembler and far-faster than __builtin_popcount. Then I tried a newer compiler on a different machine type (same processor -- and I think the mother board's the same too):

> g++ --version | head -1
g++ (Ubuntu 7.3.0-16ubuntu3) 7.3.0
> cat /proc/cpuinfo | grep 'model name' | head -1
model name      : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> g++ -O3 popcountTest.cpp
> ./a.out
countBits         : sum=1314447104: time (usec)=164247
__builtin_popcount: sum=1314447104: time (usec)=345167
assembler         : sum=1314447104: time (usec)=138028

Curiously, the older compiler optimized my countBits function better than the newer compiler, but it still compares favorably with the assembler. Clearly since the assembler line compiles and runs, my processor supports popcount, but why then is __builtin_popcount more than two times slower? And how can my own routine possibly compete with the silicon-based popcount? I'm having the same experience with other routines for finding the first set bit, etc. My routines are all significantly faster than the GNU "builtin" equivalents.

(BTW, I have no clue how to write assembler. I just found that line on some web page and it miraculously seemed to work.)

503

asked Sep 04 '18 08:09

Matthew Busche

2 Answers

Without specifying an appropriate "-march" on the command line gcc generates a call to the __popcountdi2 function rather than the popcnt instruction. See: https://godbolt.org/z/z1BihM

POPCNT is supported by Intel since Nehalem and AMD since Barcelona according to wikipedia: https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT

139

answered Sep 28 '22 12:09

Alan Birtles

I thought it might be useful to share the new performance results after adding -march=native to the compile line (as suggested by Mat and Alan Birtles) which enables use of the popcount machine instruction. The results are different depending on compiler version. Here's the older compiler:

> g++ --version | head -1
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
> cat /proc/cpuinfo | grep 'model name' | head -1
model name      : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> g++ -march=native -O3 popcountTest.cpp
> ./a.out
countBits         : sum=1314447104: time (usec)=163947
__builtin_popcount: sum=1314447104: time (usec)=138046
assembler         : sum=1314447104: time (usec)=138036

And here's the newer compiler:

> g++ --version | head -1
g++ (Ubuntu 7.3.0-16ubuntu3) 7.3.0
> cat /proc/cpuinfo | grep 'model name' | head -1
model name      : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> g++ -march=native -O3 popcountTest.cpp
> ./a.out
countBits         : sum=1314447104: time (usec)=163133
__builtin_popcount: sum=1314447104: time (usec)=73987
assembler         : sum=1314447104: time (usec)=138036

Observations:

Adding -march=native to the command line of the older g++ compiler improved the performance of __builtin_popcount to equal that of the assembler, and SLOWED my countbits routine by about 15%.
Adding -march=native to the command line of the newer g++ compiler caused the performance of __builtin_popcount to surpass that of the assembler. I presume this has something to do with the stack variable I used with the assembler, though I'm not sure. There was no effect on my countBits performance (which as stated in my question, was already slower with this newer compiler.)

answered Sep 28 '22 12:09

Matthew Busche

Related questions
                            
                                QMetaObject::connectSlotsByName: No matching signal
                            
                                How can one emulate the C++ namespace feature in a C code?
                            
                                C++ - Default arguments cannot be added to an out-of-line definition of a member of a class template
                            
                                QSqlQuery size() always returns -1
                            
                                Why do uninitialized objects of built-in type defined inside a function body have undefined value?
                            
                                Does adding enumerators into enum break ABI?
                            
                                Will two atomic writes to different locations in different threads always be seen in the same order by other threads?
                            
                                Are () and {} always equivalent when used for initialization with "new"?
                            
                                Will there be standardization of scope guard/scope exit idioms?
                            
                                C++ assignment operator in derived class [duplicate]
                            
                                C++ - How to have multiple threads write to a file
                            
                                Make an array an optional parameter for a c++ function
                            
                                Can division by non-zero still create a nan / infinity
                            
                                Linux - moving the console cursor visual
                            
                                Fastest way to produce a mask with n ones starting at position i
                            
                                Horizontal XOR in AVX
                            
                                How to unit test if a private variable has changed
                            
                                C++ const double pointer
                            
                                Difference between norm, normalize and normalized in eigen
                            
                                Why does std::is_same give a different result for the two types?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is __builtin_popcount slower than my own bit counting function?

Tags:

c++

gcc

Matthew Busche

People also ask

2 Answers

Alan Birtles

Matthew Busche

Recent Activity

Donate For Us