I am running a test performance, and found out that changing the order of the code makes it faster without compromising the result. Performance is measured by time execution using chrono library. <pre class="prettyprint"><code>vector< vector<float> > U(matrix_size, vector<float>(matrix_size,14)); vector< vector<float> > L(matrix_size, vector<float>(matrix_size,12)); vector< vector<float> > matrix_positive_definite(matrix_size, vector<float>(matrix_size,23)); for (i = 0; i < matrix_size; ++i) { for(j= 0; j < matrix_size; ++j){ //Part II : ________________________________________ float sum2=0; for(k= 0; k <= (i-1); ++k){ float sum2_temp=L[i][k]*U[k][j]; sum2+=sum2_temp; } //Part I : _____________________________________________ float sum1=0; for(k= 0; k <= (j-1); ++k){ float sum1_temp=L[i][k]*U[k][j]; sum1+=sum1_temp; } //__________________________________________ if(i>j){ L[i][j]=(matrix_positive_definite[i][j]-sum1)/U[j][j]; } else{ U[i][j]=matrix_positive_definite[i][j]-sum2; } } } </code></pre> I compile with <code>g++ -O3</code> (GCC 7.4.0 in Intel i5/Win10). I changed the order of Part I & Part II and got faster result if Part II executed before Part I. What's going on? This is the link to the whole program.

Running <code>perf stat -ddd</code> while playing around with the provided program shows that the major difference between the two versions stands mainly in the prefetch. <pre class="prettyprint"><code>part II -> part I and part I -> part II (original program) 73,069,502 L1-dcache-prefetch-misses part II -> part I and part II -> part I (only the efficient version) 31,719,117 L1-dcache-prefetch-misses part I -> part II and part I -> part II (only the less efficient version) 114,520,949 L1-dcache-prefetch-misses </code></pre> nb: according to the compiler explorer, <code>part II -> part I</code> is very similar to <code>part I -> part II</code>. I guess that, on the first iterations over <code>i</code>, <code>part II</code> does almost nothing, but iterations over <code>j</code> make <code>part I</code> access <code>U[k][j]</code> according to a pattern that will ease prefetch for the next iterations over <code>i</code>.

Why does position of code affect performance in C++?

Tags:

c++

performance

I am running a test performance, and found out that changing the order of the code makes it faster without compromising the result.

Performance is measured by time execution using chrono library.

vector< vector<float> > U(matrix_size, vector<float>(matrix_size,14));
vector< vector<float> > L(matrix_size, vector<float>(matrix_size,12));
vector< vector<float> > matrix_positive_definite(matrix_size, vector<float>(matrix_size,23));

for (i = 0; i < matrix_size; ++i) {         
   for(j= 0; j < matrix_size; ++j){
//Part II : ________________________________________
    float sum2=0;               
    for(k= 0; k <= (i-1); ++k){
      float sum2_temp=L[i][k]*U[k][j];
      sum2+=sum2_temp;
    }
//Part I : _____________________________________________
    float sum1=0;       
    for(k= 0; k <= (j-1); ++k){
      float sum1_temp=L[i][k]*U[k][j];
      sum1+=sum1_temp;
    }           
//__________________________________________
    if(i>j){
      L[i][j]=(matrix_positive_definite[i][j]-sum1)/U[j][j]; 
    }
    else{
       U[i][j]=matrix_positive_definite[i][j]-sum2;
    }   
   }
}

I compile with g++ -O3 (GCC 7.4.0 in Intel i5/Win10). I changed the order of Part I & Part II and got faster result if Part II executed before Part I. What's going on?

This is the link to the whole program.

672

asked May 25 '19 20:05

cho_uc

2 Answers

I would try running both versions with perf stat -d <app> and see where the difference of performance counters is.

When benchmarking you may like to fix the CPU frequency, so it doesn't affect your scores.

Aligning loops on a 32-byte boundary often increases performance by 8-30%. See Causes of Performance Instability due to Code Placement in X86 - Zia Ansari, Intel for more details.

Try compiling your code with -O3 -falign-loops=32 -falign-functions=32 -march=native -mtune=native.

answered Oct 06 '22 00:10

Maxim Egorushkin

Running perf stat -ddd while playing around with the provided program shows that the major difference between the two versions stands mainly in the prefetch.

part II -> part I   and   part I -> part II (original program)
   73,069,502      L1-dcache-prefetch-misses

part II -> part I   and   part II -> part I (only the efficient version)
   31,719,117      L1-dcache-prefetch-misses

part I -> part II   and   part I -> part II (only the less efficient version)
  114,520,949      L1-dcache-prefetch-misses

nb: according to the compiler explorer, part II -> part I is very similar to part I -> part II.

I guess that, on the first iterations over i, part II does almost nothing, but iterations over j make part I access U[k][j] according to a pattern that will ease prefetch for the next iterations over i.

answered Oct 05 '22 23:10

prog-fh

Related questions
                            
                                How to disable vectorization in clang++?
                            
                                Check if a type is std::basic_string<T> in compile time in C++
                            
                                Order-preserving memcpy in C++
                            
                                Deducing Multiple Parameter Packs
                            
                                Is there a way to make this shortest path algorithm faster?
                            
                                std::experimental::source_location at compile time
                            
                                When converting to unsigned, the standard says "the least unsigned integer" is the result. Why does "least" matter here?
                            
                                C++ Order of Declaration (in Multi-variable Declaration Line)
                            
                                why memory_order_relaxed performance is the same as memory_order_seq_cst
                            
                                Why can't I create a template function with an optional UnaryPredicate argument?
                            
                                Branchless version of swapping x with y if x > y?
                            
                                Boost asio run vs work (ambiguity) - what's the purpose of the work class?
                            
                                C++17 lambda captures with relaxed type requirements
                            
                                Why use the Global Offset Table for symbols defined in the shared library itself?
                            
                                C++ regex_search breaks when compiled with -O1 [duplicate]
                            
                                How to format pointers using fmt library?
                            
                                How to use structured binding in an array passed as arg to some function?
                            
                                What are pros and cons of std::initializer_list and c array []?
                            
                                Is there boost::visit like std::visit, for boost::variant?
                            
                                Using std::launder to get a pointer to an active union member from a pointer to an inactive union member?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With