So with the following code, changing the type of the parameter x
from const ull
to const ull&
(with typedef unsigned long long ull
) results in a roughly 25% speedup when compiled with gcc 4.7.2 and flags -O3 -std=c++11 -g
, and I can't figure out why this would make such a big difference.
static void inline single_mult(const std::vector<ull>::iterator& data, const std::vector<ull>::const_iterator& rbegin, const std::vector<ull>::const_iterator& rend, const ull x) { ull tmp=0, carry=0, i=0; for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) { tmp = x*(*rhs_it) + data[i] + carry; if (tmp >= imax) { carry = tmp >> numbits; tmp &= imax - 1; } else { carry = 0; } data[i++] = tmp; } data[i] += carry; }
It is called in the following function (for doing schoolbook long multiplication)
static void naive(std::vector<ull>::iterator data, std::vector<ull>::const_iterator cbegin, std::vector<ull>::const_iterator cend , std::vector<ull>::const_iterator rbegin, std::vector<ull>::const_iterator rend) { for (auto data_it = cbegin; data_it != cend; ++data_it) { if (*data_it != 0) { single_mult(data, rbegin, rend, *data_it); } ++data; } }
The timing is done by calling clock()
around a loop to measure how long it takes. Not the most accurate/precise way, but I figured a consistent 25% difference meant the difference was statistically significant.
Full working code block:
#include <vector> #include <limits> #include <ctime> #include <iostream> typedef unsigned long long ull; typedef unsigned int uint; const ull numbits = (ull) std::numeric_limits<uint>::digits; const ull imax = 1LL << numbits; static void inline single_mult(const std::vector<ull>::iterator& data, const std::vector<ull>::const_iterator& rbegin, const std::vector<ull>::const_iterator& rend, const ull x) { ull tmp=0, carry=0, i=0; for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) { tmp = x*(*rhs_it) + data[i] + carry; if (tmp >= imax) { carry = tmp >> numbits; tmp &= imax - 1; } else { carry = 0; } data[i++] = tmp; } data[i] += carry; } static void naive(std::vector<ull>::iterator data, std::vector<ull>::const_iterator cbegin, std::vector<ull>::const_iterator cend , std::vector<ull>::const_iterator rbegin, std::vector<ull>::const_iterator rend) { for (auto data_it = cbegin; data_it != cend; ++data_it) { if (*data_it != 0) { single_mult(data, rbegin, rend, *data_it); } ++data; } } int main() { int N = 10000; std::vector<ull> vec(N); for (int i=0; i<N; ++i) { vec[i] = i; } auto s1 = clock(); int num = 10; std::vector<ull> result(2*N); for (int i=0; i<num; ++i) { //Need to zero out the vector or the result will be different. std::fill(result.begin(), result.end(), 0); naive(result.begin(), vec.cbegin(), vec.cend(), vec.cbegin(), vec.cend()); } s1 = (clock() - s1); std::cout << "Took " << float(s1)/CLOCKS_PER_SEC << "seconds total." << std::endl; }
And runtimes (I named the file that passes the value value.cpp
and reference reference.cpp
)
$ g++ -O3 -std=c++11 -g -o reference reference.cpp $ g++ -O3 -std=c++11 -g -o value value.cpp $ ./reference Took 1.05seconds total. $ ./value Took 1.83seconds total.
I was able to reproduce your observation of the speedup, it was even more noticeable for me (1.75x faster). The problem seems to be when you pass x by value it enables the compiler to perform optimizations that it otherwise doesn't do, and these optimizations backfire, they apparently introduce an unintended stall in the processor. The compiler generates a couple of conditional moves, back to back, rather than compare and branch. The compare and branch runs much faster than the back-to-back conditional moves.
I was able to avoid this by simplifying the code that the compiler had trouble with, namely this
if (tmp >= imax) { carry = tmp >> numbits; tmp &= imax - 1; } else { carry = 0; }
can be reduced to
carry = tmp >> numbits; tmp &= imax - 1;
Here's the version of gcc I'm using
g++ --version g++ (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2)
These are the commands I used, perf record
will profile your code and perf report
will annotate the source and disassembly with the results of the profile
g++ -std=gnu++0x -O3 -g single_mult.cpp -o single_mult perf record ./single_mult perf report
Once in perf report press enter
on main
and select Annotate main
, you'll see a disassembly of your program along with source code and the percent of the time the profiler found your program running at each instruction in the function... actually these numbers must be taken as a hint only, often you will see instructions with big counts when in fact it was the previous instruction that stalled, or missed in the cache, or there was a mis-predicted branch, etc. So when you see a big count look back to see what may have caused it. The reason is the profile is statistical, it interrupts your program at a constant rate and looks to see where the instruction pointer is, and often the interrupt happens while the processor is stalled due to a cache miss or a mis-predicted branch or some internal data dependency.
I increased the number of iterations to allow more time for the profiler
int N = 20000;
When x is passed by value it Took 11.58seconds total
and this is what I see, notice the cmovbe
instructions
: ull tmp=0, carry=0, i=0; : for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) { : tmp = x*(*rhs_it) + data[i] + carry; 11.10 : 400b40: 4c 89 c9 mov %r9,%rcx 0.00 : 400b43: 48 89 fa mov %rdi,%rdx 0.01 : 400b46: 48 03 14 c6 add (%rsi,%rax,8),%rdx 11.65 : 400b4a: 48 0f af 0c c3 imul (%rbx,%rax,8),%rcx 0.99 : 400b4f: 48 01 ca add %rcx,%rdx : if (tmp >= imax) { : carry = tmp >> numbits; 2.25 : 400b52: 48 89 d7 mov %rdx,%rdi : tmp &= imax - 1; 10.99 : 400b55: 48 89 d1 mov %rdx,%rcx : const ull x) { : ull tmp=0, carry=0, i=0; : for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) { : tmp = x*(*rhs_it) + data[i] + carry; : if (tmp >= imax) { : carry = tmp >> numbits; 0.69 : 400b58: 48 c1 ef 20 shr $0x20,%rdi : tmp &= imax - 1; 9.54 : 400b5c: 83 e1 ff and $0xffffffff,%ecx 9.05 : 400b5f: 4c 39 c2 cmp %r8,%rdx 10.78 : 400b62: 49 0f 46 fb cmovbe %r11,%rdi : } else { : carry = 0; : } : data[i++] = tmp; 20.73 : 400b66: 48 83 c0 01 add $0x1,%rax 0.02 : 400b6a: 4c 39 c2 cmp %r8,%rdx 0.17 : 400b6d: 48 0f 46 ca cmovbe %rdx,%rcx : static void inline single_mult(const std::vector<ull>::iterator& data, : const std::vector<ull>::const_iterator& rbegin, : const std::vector<ull>::const_iterator& rend, : const ull x) { : ull tmp=0, carry=0, i=0; : for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) { 11.47 : 400b71: 4c 39 d0 cmp %r10,%rax : carry = tmp >> numbits; : tmp &= imax - 1; : } else { : carry = 0; : } : data[i++] = tmp; 0.01 : 400b74: 48 89 4c c6 f8 mov %rcx,-0x8(%rsi,%rax,8) : static void inline single_mult(const std::vector<ull>::iterator& data, : const std::vector<ull>::const_iterator& rbegin, : const std::vector<ull>::const_iterator& rend, : const ull x) { : ull tmp=0, carry=0, i=0; : for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) { 0.53 : 400b79: 75 c5 jne 400b40 <main+0x250> : } else { : carry = 0; : } : data[i++] = tmp; : } : data[i] += carry; 0.00 : 400b7b: 4a 01 3c d6 add %rdi,(%rsi,%r10,8) 0.01 : 400b7f: 48 83 c5 08 add $0x8,%rbp
When x is passed by reference it Took 6.59seconds total
, this is what I see
: ull tmp=0, carry=0, i=0; : for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) { 20.90 : 400b30: 48 8b 17 mov (%rdi),%rdx : tmp = x*(*rhs_it) + data[i] + carry; 1.38 : 400b33: 49 0f af 14 c1 imul (%r9,%rax,8),%rdx 4.82 : 400b38: 48 03 0c c6 add (%rsi,%rax,8),%rcx 22.41 : 400b3c: 48 01 ca add %rcx,%rdx : if (tmp >= imax) { : carry = tmp >> numbits; : tmp &= imax - 1; : } else { : carry = 0; 2.95 : 400b3f: 31 c9 xor %ecx,%ecx : const std::vector<ull>::const_iterator& rend, : const ull &x) { : ull tmp=0, carry=0, i=0; : for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) { : tmp = x*(*rhs_it) + data[i] + carry; : if (tmp >= imax) { 0.23 : 400b41: 4c 39 d2 cmp %r10,%rdx 0.00 : 400b44: 76 0a jbe 400b50 <main+0x260> : carry = tmp >> numbits; 2.27 : 400b46: 48 89 d1 mov %rdx,%rcx : tmp &= imax - 1; 1.29 : 400b49: 83 e2 ff and $0xffffffff,%edx : const ull &x) { : ull tmp=0, carry=0, i=0; : for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) { : tmp = x*(*rhs_it) + data[i] + carry; : if (tmp >= imax) { : carry = tmp >> numbits; 0.26 : 400b4c: 48 c1 e9 20 shr $0x20,%rcx : tmp &= imax - 1; : } else { : carry = 0; : } : data[i++] = tmp; 19.67 : 400b50: 48 83 c0 01 add $0x1,%rax : static void inline single_mult(const std::vector<ull>::iterator& data, : const std::vector<ull>::const_iterator& rbegin, : const std::vector<ull>::const_iterator& rend, : const ull &x) { : ull tmp=0, carry=0, i=0; : for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) { 0.53 : 400b54: 4c 39 c0 cmp %r8,%rax : carry = tmp >> numbits; : tmp &= imax - 1; : } else { : carry = 0; : } : data[i++] = tmp; 0.39 : 400b57: 48 89 54 c6 f8 mov %rdx,-0x8(%rsi,%rax,8) : static void inline single_mult(const std::vector<ull>::iterator& data, : const std::vector<ull>::const_iterator& rbegin, : const std::vector<ull>::const_iterator& rend, : const ull &x) { : ull tmp=0, carry=0, i=0; : for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) { 22.91 : 400b5c: 75 d2 jne 400b30 <main+0x240> : } else { : carry = 0; : } : data[i++] = tmp; : } : data[i] += carry; 0.00 : 400b5e: 4a 01 0c c6 add %rcx,(%rsi,%r8,8) 0.00 : 400b62: 48 83 c7 08 add $0x8,%rdi
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With