Why does changing `const ull` to `const ull&` in function parameter result in performance gain?

Question

So with the following code, changing the type of the parameter x from const ull to const ull& (with typedef unsigned long long ull) results in a roughly 25% speedup when compiled with gcc 4.7.2 and flags -O3 -std=c++11 -g, and I can't figure out why this would make such a big difference.

static void inline single_mult(const std::vector<ull>::iterator& data,                   const std::vector<ull>::const_iterator& rbegin,                   const std::vector<ull>::const_iterator& rend,                   const ull x) {         ull tmp=0, carry=0, i=0;         for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                 tmp = x*(*rhs_it) + data[i] + carry;                 if (tmp >= imax) {                         carry = tmp >> numbits;                         tmp &= imax - 1;                 } else {                          carry = 0;                 }                 data[i++] = tmp;         }         data[i] += carry; }

It is called in the following function (for doing schoolbook long multiplication)

static void naive(std::vector<ull>::iterator data,                std::vector<ull>::const_iterator cbegin,               std::vector<ull>::const_iterator cend  ,               std::vector<ull>::const_iterator rbegin,               std::vector<ull>::const_iterator rend) {      for (auto data_it  = cbegin;           data_it != cend; ++data_it) {         if (*data_it != 0) {             single_mult(data, rbegin, rend, *data_it);         }         ++data;     } }

The timing is done by calling clock() around a loop to measure how long it takes. Not the most accurate/precise way, but I figured a consistent 25% difference meant the difference was statistically significant.

Full working code block:

#include <vector> #include <limits> #include <ctime> #include <iostream>  typedef unsigned long long ull; typedef unsigned int uint; const ull numbits = (ull) std::numeric_limits<uint>::digits;         const ull imax = 1LL << numbits;       static void inline single_mult(const std::vector<ull>::iterator& data,               const std::vector<ull>::const_iterator& rbegin,               const std::vector<ull>::const_iterator& rend,               const ull x) {     ull tmp=0, carry=0, i=0;     for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {             tmp = x*(*rhs_it) + data[i] + carry;             if (tmp >= imax) {                     carry = tmp >> numbits;                     tmp &= imax - 1;             } else {                     carry = 0;             }             data[i++] = tmp;     }     data[i] += carry; }  static void naive(std::vector<ull>::iterator data,               std::vector<ull>::const_iterator cbegin,               std::vector<ull>::const_iterator cend  ,               std::vector<ull>::const_iterator rbegin,               std::vector<ull>::const_iterator rend) {      for (auto data_it  = cbegin; data_it != cend; ++data_it) {             if (*data_it != 0) {                     single_mult(data, rbegin, rend, *data_it);             }     ++data;     } }   int main() {     int N = 10000;     std::vector<ull> vec(N);     for (int i=0; i<N; ++i) {         vec[i] = i;     }      auto s1 = clock();     int num = 10;     std::vector<ull> result(2*N);     for (int i=0; i<num; ++i) {     //Need to zero out the vector or the result will be different.         std::fill(result.begin(), result.end(), 0);         naive(result.begin(), vec.cbegin(), vec.cend(), vec.cbegin(), vec.cend());     }     s1 = (clock() - s1);     std::cout << "Took " << float(s1)/CLOCKS_PER_SEC << "seconds total." << std::endl; }

And runtimes (I named the file that passes the value value.cpp and reference reference.cpp)

$ g++ -O3 -std=c++11 -g -o reference reference.cpp $ g++ -O3 -std=c++11 -g -o value value.cpp $ ./reference                                                                                                                                            Took 1.05seconds total.                                                                         $ ./value                                                                  Took 1.83seconds total.

amdn · Accepted Answer

I was able to reproduce your observation of the speedup, it was even more noticeable for me (1.75x faster). The problem seems to be when you pass x by value it enables the compiler to perform optimizations that it otherwise doesn't do, and these optimizations backfire, they apparently introduce an unintended stall in the processor. The compiler generates a couple of conditional moves, back to back, rather than compare and branch. The compare and branch runs much faster than the back-to-back conditional moves.

I was able to avoid this by simplifying the code that the compiler had trouble with, namely this

if (tmp >= imax) {     carry = tmp >> numbits;     tmp &= imax - 1; } else {     carry = 0; }

can be reduced to

carry = tmp >> numbits; tmp &= imax - 1;

Here's the version of gcc I'm using

g++ --version g++ (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2)

These are the commands I used, perf record will profile your code and perf report will annotate the source and disassembly with the results of the profile

 g++ -std=gnu++0x -O3 -g single_mult.cpp -o single_mult  perf record ./single_mult  perf report

Once in perf report press enter on main and select Annotate main, you'll see a disassembly of your program along with source code and the percent of the time the profiler found your program running at each instruction in the function... actually these numbers must be taken as a hint only, often you will see instructions with big counts when in fact it was the previous instruction that stalled, or missed in the cache, or there was a mis-predicted branch, etc. So when you see a big count look back to see what may have caused it. The reason is the profile is statistical, it interrupts your program at a constant rate and looks to see where the instruction pointer is, and often the interrupt happens while the processor is stalled due to a cache miss or a mis-predicted branch or some internal data dependency.

I increased the number of iterations to allow more time for the profiler

int N = 20000;

When x is passed by value it Took 11.58seconds total and this is what I see, notice the cmovbe instructions

       :            ull tmp=0, carry=0, i=0;                                                                                                                                                             :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                               :                    tmp = x*(*rhs_it) + data[i] + carry;                                                                                                                                   11.10 :          400b40:       4c 89 c9                mov    %r9,%rcx                                                                                                                             0.00 :          400b43:       48 89 fa                mov    %rdi,%rdx                                                                                                                            0.01 :          400b46:       48 03 14 c6             add    (%rsi,%rax,8),%rdx                                                                                                                  11.65 :          400b4a:       48 0f af 0c c3          imul   (%rbx,%rax,8),%rcx                                                                                                                   0.99 :          400b4f:       48 01 ca                add    %rcx,%rdx                                                                                                                                 :                    if (tmp >= imax) {                                                                                                                                                           :                            carry = tmp >> numbits;                                                                                                                                         2.25 :          400b52:       48 89 d7                mov    %rdx,%rdi                                                                                                                                 :                            tmp &= imax - 1;                                                                                                                                              10.99 :          400b55:       48 89 d1                mov    %rdx,%rcx                                                                                                                                 :                      const ull x) {                                                                                                                                                             :            ull tmp=0, carry=0, i=0;                                                                                                                                                             :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                               :                    tmp = x*(*rhs_it) + data[i] + carry;                                                                                                                                         :                    if (tmp >= imax) {                                                                                                                                                            :                            carry = tmp >> numbits;                                                                                                                                         0.69 :          400b58:       48 c1 ef 20             shr    $0x20,%rdi                                                                                                                                :                            tmp &= imax - 1;                                                                                                                                                 9.54 :          400b5c:       83 e1 ff                and    $0xffffffff,%ecx                                                                                                                      9.05 :          400b5f:       4c 39 c2                cmp    %r8,%rdx                                                                                                                            10.78 :          400b62:       49 0f 46 fb             cmovbe %r11,%rdi                                                                                                                                  :                    } else {                                                                                                                                                                      :                            carry = 0;                                                                                                                                                            :                    }                                                                                                                                                                             :                    data[i++] = tmp;                                                                                                                                                       20.73 :          400b66:       48 83 c0 01             add    $0x1,%rax                                                                                                                             0.02 :          400b6a:       4c 39 c2                cmp    %r8,%rdx                                                                                                                              0.17 :          400b6d:       48 0f 46 ca             cmovbe %rdx,%rcx                                                                                                                                 :        static void inline single_mult(const std::vector<ull>::iterator& data,                                                                                                                    :                      const std::vector<ull>::const_iterator& rbegin,                                                                                                                            :                      const std::vector<ull>::const_iterator& rend,                                                                                                                              :                      const ull x) {                                                                                                                                                             :            ull tmp=0, carry=0, i=0;                                                                                                                                                             :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                         11.47 :          400b71:       4c 39 d0                cmp    %r10,%rax                                                                                                                                 :                            carry = tmp >> numbits;                                                                                                                                              :                            tmp &= imax - 1;                                                                                                                                                     :                    } else {                                                                                                                                                                     :                            carry = 0;                                                                                                                                                           :                    }                                                                                                                                                                            :                    data[i++] = tmp;                                                                                                                                                        0.01 :          400b74:       48 89 4c c6 f8          mov    %rcx,-0x8(%rsi,%rax,8)                                                                                                                    :        static void inline single_mult(const std::vector<ull>::iterator& data,                                                                                                                   :                      const std::vector<ull>::const_iterator& rbegin,                                                                                                                            :                      const std::vector<ull>::const_iterator& rend,                                                                                                                              :                      const ull x) {                                                                                                                                                             :            ull tmp=0, carry=0, i=0;                                                                                                                                                             :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                          0.53 :          400b79:       75 c5                   jne    400b40 <main+0x250>                                                                                                                       :                    } else {                                                                                                                                                                     :                            carry = 0;                                                                                                                                                           :                    }                                                                                                                                                                            :                    data[i++] = tmp;                                                                                                                                                             :            }                                                                                                                                                                                     :            data[i] += carry;                                                                                                                                                               0.00 :          400b7b:       4a 01 3c d6             add    %rdi,(%rsi,%r10,8)                                                                                                                   0.01 :          400b7f:       48 83 c5 08             add    $0x8,%rbp

When x is passed by reference it Took 6.59seconds total, this is what I see

       :            ull tmp=0, carry=0, i=0;                                                                                                                                                             :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                         20.90 :          400b30:       48 8b 17                mov    (%rdi),%rdx                                                                                                                               :                    tmp = x*(*rhs_it) + data[i] + carry;                                                                                                                                    1.38 :          400b33:       49 0f af 14 c1          imul   (%r9,%rax,8),%rdx                                                                                                                      4.82 :          400b38:       48 03 0c c6             add    (%rsi,%rax,8),%rcx                                                                                                                  22.41 :          400b3c:       48 01 ca                add    %rcx,%rdx                                                                                                                                 :                    if (tmp >= imax) {                                                                                                                                                           :                            carry = tmp >> numbits;                                                                                                                                              :                            tmp &= imax - 1;                                                                                                                                                       :                    } else {                                                                                                                                                                     :                            carry = 0;                                                                                                                                                      2.95 :          400b3f:       31 c9                   xor    %ecx,%ecx                                                                                                                                 :                      const std::vector<ull>::const_iterator& rend,                                                                                                                              :                      const ull &x) {                                                                                                                                                            :            ull tmp=0, carry=0, i=0;                                                                                                                                                             :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                               :                    tmp = x*(*rhs_it) + data[i] + carry;                                                                                                                                         :                    if (tmp >= imax) {                                                                                                                                                      0.23 :          400b41:       4c 39 d2                cmp    %r10,%rdx                                                                                                                            0.00 :          400b44:       76 0a                   jbe    400b50 <main+0x260>                                                                                                                       :                            carry = tmp >> numbits;                                                                                                                                         2.27 :          400b46:       48 89 d1                mov    %rdx,%rcx                                                                                                                                 :                            tmp &= imax - 1;                                                                                                                                                1.29 :          400b49:       83 e2 ff                and    $0xffffffff,%edx                                                                                                                          :                      const ull &x) {                                                                                                                                                            :            ull tmp=0, carry=0, i=0;                                                                                                                                                             :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                               :                    tmp = x*(*rhs_it) + data[i] + carry;                                                                                                                                         :                    if (tmp >= imax) {                                                                                                                                                           :                            carry = tmp >> numbits;                                                                                                                                         0.26 :          400b4c:       48 c1 e9 20             shr    $0x20,%rcx                                                                                                                                :                            tmp &= imax - 1;                                                                                                                                                     :                    } else {                                                                                                                                                                     :                            carry = 0;                                                                                                                                                           :                    }                                                                                                                                                                            :                    data[i++] = tmp;                                                                                                                                                       19.67 :          400b50:       48 83 c0 01             add    $0x1,%rax                                                                                                                                 :        static void inline single_mult(const std::vector<ull>::iterator& data,                                                                                                                   :                      const std::vector<ull>::const_iterator& rbegin,                                                                                                                            :                      const std::vector<ull>::const_iterator& rend,                                                                                                                              :                      const ull &x) {                                                                                                                                                            :            ull tmp=0, carry=0, i=0;                                                                                                                                                             :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                          0.53 :          400b54:       4c 39 c0                cmp    %r8,%rax                                                                                                                                  :                            carry = tmp >> numbits;                                                                                                                                              :                            tmp &= imax - 1;                                                                                                                                                     :                    } else {                                                                                                                                                                     :                            carry = 0;                                                                                                                                                           :                    }                                                                                                                                                                              :                    data[i++] = tmp;                                                                                                                                                        0.39 :          400b57:       48 89 54 c6 f8          mov    %rdx,-0x8(%rsi,%rax,8)                                                                                                                    :        static void inline single_mult(const std::vector<ull>::iterator& data,                                                                                                                   :                      const std::vector<ull>::const_iterator& rbegin,                                                                                                                            :                      const std::vector<ull>::const_iterator& rend,                                                                                                                              :                      const ull &x) {                                                                                                                                                            :            ull tmp=0, carry=0, i=0;                                                                                                                                                             :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                         22.91 :          400b5c:       75 d2                   jne    400b30 <main+0x240>                                                                                                                       :                    } else {                                                                                                                                                                     :                            carry = 0;                                                                                                                                                           :                    }                                                                                                                                                                            :                    data[i++] = tmp;                                                                                                                                                             :            }                                                                                                                                                                                    :            data[i] += carry;                                                                                                                                                               0.00 :          400b5e:       4a 01 0c c6             add    %rcx,(%rsi,%r8,8)                                                                                                                    0.00 :          400b62:       48 83 c7 08             add    $0x8,%rdi

Why does changing `const ull` to `const ull&` in function parameter result in performance gain?

Tags:

c++

performance

optimization

gcc

c++11

0xE6

1 Answers

amdn

Recent Activity

Donate For Us

Why does changing `const ull` to `const ull&` in function parameter result in performance gain?

Tags:

c++

performance

optimization

gcc

c++11

0xE6

1 Answers

amdn

Related questions

Recent Activity

Donate For Us