I am running a test performance, and found out that changing the order of the code makes it faster without compromising the result.
Performance is measured by time execution using chrono library.
vector< vector<float> > U(matrix_size, vector<float>(matrix_size,14));
vector< vector<float> > L(matrix_size, vector<float>(matrix_size,12));
vector< vector<float> > matrix_positive_definite(matrix_size, vector<float>(matrix_size,23));
for (i = 0; i < matrix_size; ++i) {
for(j= 0; j < matrix_size; ++j){
//Part II : ________________________________________
float sum2=0;
for(k= 0; k <= (i-1); ++k){
float sum2_temp=L[i][k]*U[k][j];
sum2+=sum2_temp;
}
//Part I : _____________________________________________
float sum1=0;
for(k= 0; k <= (j-1); ++k){
float sum1_temp=L[i][k]*U[k][j];
sum1+=sum1_temp;
}
//__________________________________________
if(i>j){
L[i][j]=(matrix_positive_definite[i][j]-sum1)/U[j][j];
}
else{
U[i][j]=matrix_positive_definite[i][j]-sum2;
}
}
}
I compile with g++ -O3
(GCC 7.4.0 in Intel i5/Win10).
I changed the order of Part I & Part II and got faster result if Part II executed before Part I. What's going on?
This is the link to the whole program.
It is certainly the case that unnecessary code should be avoided. Code that doesn't need to be there can cause unforeseen problems, make your software less performant, and make it harder for other developers to perform code maintenance.
It will either make it faster, or slower, or not change it, depending on which specific language you use and what is the structure of the actual code and possibly on what version of the compiler you're using and maybe even what platform you're running on.
Direct link to this question. According to the documentation and this MATLAB answer, functions are generally faster than scripts.
I would try running both versions with perf stat -d <app>
and see where the difference of performance counters is.
When benchmarking you may like to fix the CPU frequency, so it doesn't affect your scores.
Aligning loops on a 32-byte boundary often increases performance by 8-30%. See Causes of Performance Instability due to Code Placement in X86 - Zia Ansari, Intel for more details.
Try compiling your code with -O3 -falign-loops=32 -falign-functions=32 -march=native -mtune=native
.
Running perf stat -ddd
while playing around with the provided program shows that the major difference between the two versions stands mainly in the prefetch.
part II -> part I and part I -> part II (original program)
73,069,502 L1-dcache-prefetch-misses
part II -> part I and part II -> part I (only the efficient version)
31,719,117 L1-dcache-prefetch-misses
part I -> part II and part I -> part II (only the less efficient version)
114,520,949 L1-dcache-prefetch-misses
nb: according to the compiler explorer, part II -> part I
is very similar to part I -> part II
.
I guess that, on the first iterations over i
, part II
does almost nothing, but iterations over j
make part I
access U[k][j]
according to a pattern that will ease prefetch for the next iterations over i
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With