I have an array A
with zeros and ones. I want to find the sum of all numbers in A
. I want to test two functions:
First function
void test1(int curIndex){ if(curIndex == size) return; test1(curIndex+1); s+=A[curIndex]; }
Second function
void test2(int curIndex){ if(curIndex == size) return; s+=A[curIndex]; test2(curIndex+1); }
I used the PAPI library to count the number of instructions, here is the entire experiment:
#include <iostream> #include <fstream> #include "Statistics.h" using namespace std; int size; int *A; int s; void test3(int curIndex){ if(curIndex == size) return; test3(curIndex+1); s+=A[curIndex]; } int main(int argc, char* argv[]){ size = atoi(argv[1]); if(argc!=2){ cout<<"type ./executable size{odd integer}"<<endl; return 1; } if(size%2!=1){ cout<<"size must be an odd number"<<endl; return 1; } A = new int[size]; int i; for(i=0;i<size;i++){ if(i%2==0){ A[i] = false; } else{ A[i] = true; } } Statistics stat(1); stat.start(); test3(0); stat.stop(); stat.printWithHelp(); cout<<s<<endl; return 0; }
Here is the Statistics.h
file:
#ifndef TRIPLETPARSER_STATISTICS_H #define TRIPLETPARSER_STATISTICS_H #include <time.h> #include <unistd.h> #include <fstream> #include <papi.h> #include <iostream> #include <iostream> #define BILLION 1000000000LL using namespace std; class Statistics { private: timespec s, e; /* PAPI_BR_CN Conditional branch instructions PAPI_BR_INS Branch instructions PAPI_BR_MSP Conditional branch instructions mispredicted PAPI_BR_NTK Conditional branch instructions not taken PAPI_BR_PRC Conditional branch instructions correctly predicted PAPI_BR_TKN Conditional branch instructions taken PAPI_BR_UCN Unconditional branch instructions PAPI_BRU_IDL Cycles branch units are idle PAPI_BTAC_M Branch target address cache misses PAPI_TLB_DM Data translation lookaside buffer misses */ int events[10]; // , PAPI_L2_TCA,PAPI_L3_TCM,PAPI_L3_TCA PAPI_BR_CN, PAPI_BR_PRC}; //type of events we are interested in int num_hwcntrs; //total amount of events stored in 'events' array long long values[10]; long long counters[10]; void handle_error(int err){ std::cerr << "PAPI error: " << err << std::endl; } public: Statistics(int papi){ for(size_t i = 0; i< 10; i++) counters[i]=0.0; switch(papi){ case 0: num_hwcntrs = 0; break; case 1: num_hwcntrs = 6; events[0] = PAPI_L2_TCA; events[1] = PAPI_L3_TCA; events[2] = PAPI_L3_TCM; events[3] = PAPI_TOT_INS; events[4] = PAPI_BR_INS; events[5] = PAPI_BR_MSP; break; } } void start(){ for(size_t i = 0; i< 10; i++) counters[i]=0.0; if (num_hwcntrs != 0 && PAPI_start_counters(events, num_hwcntrs) != PAPI_OK) handle_error(1); } void start(float ratio){ if (num_hwcntrs != 0 && PAPI_start_counters(events, num_hwcntrs) != PAPI_OK) handle_error(1); } void stop(){ if (num_hwcntrs != 0 && PAPI_stop_counters(values, num_hwcntrs) != PAPI_OK) handle_error(1); update(); } void stop(float ratio){ if (num_hwcntrs != 0 && PAPI_stop_counters(values, num_hwcntrs) != PAPI_OK) handle_error(1); update(); } void update(){ for(size_t i = 0; i < num_hwcntrs; i++) counters[i] += values[i]; } void print(){ for(int i=0;i<num_hwcntrs;i++) std::cout << counters[i] << "\t"; //cout<<"L2 cache miss ratio: "<<counters[1]/(double)counters[0]<<endl; //cout<<"L3 cache miss ratio: "<<counters[3]/(double)counters[2]<<endl; } void printWithHelp(){ cout<<"L2 accesses: "<<counters[0]<<endl; cout<<"L2 miss/access ratio: "<<(double)counters[1]/counters[0]<<endl; cout<<"L3 accesses: "<<counters[1]<<endl; cout<<"L3 misses: "<<counters[2]<<endl; cout<<"L3 miss/access ratio: "<<(double)counters[2]/counters[1]<<endl; cout<<"Instructions: "<<counters[3]<<endl; cout<<"Branches: "<<counters[4]<<endl; cout<<"Branch mispredictions: "<<counters[5]<<endl; cout<<"Branch miss/predict ratio: "<<(double)counters[5]/counters[4]<<endl; } void print(float avg_ratio){ for (int i = 0; i<num_hwcntrs; i++) std::cout << (double)(avg_ratio*counters[i]) << "\t"; } }; #endif //TRIPLETPARSER_STATISTICS_H
This is the output that I get from the first function when the size of A
is 111,111
L2 accesses: 24126 L2 miss/access ratio: 0.131559 L3 accesses: 3174 L3 misses: 587 L3 miss/access ratio: 0.18494 Instructions: 1022776 Branches: 178113 Branch mispredictions: 6976 Branch miss/predict ratio: 0.0391661
This is the output that I get from the second function when the size of A
is 111,111
L2 accesses: 7090 L2 miss/access ratio: 0.163752 L3 accesses: 1161 L3 misses: 507 L3 miss/access ratio: 0.436693 Instructions: 555860 Branches: 111189 Branch mispredictions: 25 Branch miss/predict ratio: 0.000224842
Why the difference in the results? The instructions reduce by half, the branch mispredictions are almost eliminated. What is happening here?
Yes, it's called function overloading. Multiple functions are able to have the same name if you like, however MUST have different parameters.
First, the compiler converts the pure C++ code, now stripped of preprocessor directives, into low-level assembly code. In this parsing step, the compiler optimizes the source code by pointing out syntax errors, overload resolution errors and any other compile-time errors.
__builtin_* functions are optimised functions provided by the compiler libraries. These might be builtin versions of standard library functions, such as memcpy, and perhaps more typically some of the maths functions.
Your second function is tail-recursive. That means the compiler can optimize it to something like:
void test2(int curIndex){ while(true) { if(curIndex == size) return; s+=A[curIndex]; curIndex = curIndex + 1; } }
That reduces the number of instructions significantly. It also reduces the number of stack frames needed to (at most) one. As a result, it uses a lot less memory, which results in the reduction of cache misses.
The compiler is not able to make this optimization for the first function.
UPDATE: Some people have asked why the compiler cannot make that optimization on the first function.
Let's start by observing the function is not tail-recursive. A function is tail recursive if the very last thing that happens is a recursive call to the same function, followed by returning the result of that recursive call (if any).
Clearly, that is not the case for the first function, s+=A[curIndex];
is executed after the recursive call.
So then people have asked why the compiler cannot turn the first function into the second one.
The question is "why does G++ not have this feature?" The answer to that question is always the same. Features are unimplemented by default; G++ does not have that feature because no one designed, implemented and shipped the feature to customers.
That should be the end of it, but of course people will want to know why nobody designed, implemented and tested this feature. Well, maybe nobody thought of doing it. But more importantly, the feature would be far from trivial.
First of all, the compiler would have to understand that
test1(curIndex+1); s+=A[curIndex];
and
s+=A[curIndex]; test1(curIndex+1);
are equivalent. That is a non trivial observation, given that from a mechanical point of view they are not equivalent! Indeed, the first one effectively loops from the end of the array to the start, whereas the second one loops from the start to the end. Is that the same? It yields the same result when A is an int* (and s in an int), but it doesn't in other cases (e.g. when A is a double* and s is a double). Do we expect a compiler to be that smart?
So here we have a potential feature with a high cost to implement. But the cost may be worth it, if the benefit is high. Is the benefit high? I would guess that this happens very little in real code, i.e. developers are likely to write the second form anyway. So there you have it: an expensive feature with little benefit. IMHO, compiler developers are wise to spend their valuable time on more useful features.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With