Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimization of double subtraction in C++

I have the following code that I use to compute the distance between two vectors:

double dist(vector<double> & vecA, vector<double> & vecB){
    double curDist = 0.0;
    for (size_t i = 0; i < vecA.size(); i++){
        double dif = vecA[i] - vecB[i];
        curDist += dif * dif;
    }

    return curDist;
}

This function is a major bottleneck in my application since it relies on a lot of distance calculations, consuming more than 60% of CPU time on a typical input. Additionally, the following line:

double dif = vecA[i] - vecB[i];

is responsible for more than 77% of CPU time in this function. My question is: is it possible to somehow optimize this function?

Notes:

  • To profile my application I have used Intel Amplifier XE;
  • Reducing the number of distance computations is not a feasible solution for me;
like image 866
Alceu Costa Avatar asked Dec 28 '22 06:12

Alceu Costa


2 Answers

There are two possible issues I can think of right now:

  • This computation is memory bound.
  • There is an iteration-to-iteration dependency on curDist.

This computation is memory bound.

Your dataset is larger than your CPU cache. So in this case, no amount of optimization is going to help unless you can restructure your algorithm.


There is an iteration-to-iteration dependency on curDist.

You have a dependency on curDist. This will block vectorization by the compiler. (Also, don't always trust the profiler numbers to the line. They can be inaccurate especially after compiler optimizations.)

Normally, the compiler vectorizer can split up the curDist into multiple partial sums to and unroll/vectorize the loop. But it can't do that under strict-floating-point behavior. You can try relaxing your floating-point mode if you haven't already. Or you can split the sum and unroll it yourself.

For example, this kind of optimization is something the compiler can do with integers, but not necessarily with floating-point:

double curDist0 = 0.0;
double curDist1 = 0.0;
double curDist2 = 0.0;
double curDist3 = 0.0;
for (size_t i = 0; i < vecA.size() - 3; i += 4){
    double dif0 = vecA[i + 0] - vecB[i + 0];
    double dif1 = vecA[i + 1] - vecB[i + 1];
    double dif2 = vecA[i + 2] - vecB[i + 2];
    double dif3 = vecA[i + 3] - vecB[i + 3];
    curDist0 += dif0 * dif0;
    curDist1 += dif1 * dif1;
    curDist2 += dif2 * dif2;
    curDist3 += dif3 * dif3;
}

//  Do some sort of cleanup in case (vecA.size() % 4 != 0)

double curDist = curDist0 + curDist1 + curDist2 + curDist3;
like image 195
Mysticial Avatar answered Jan 08 '23 02:01

Mysticial


You could eliminate the call to vecA.size() for each iteration of the loop, just call it once before the loop. You could also do loop unrolling to give yourself more computation per loop iteration. What compiler are you using, and what optimization settings? Compiler will often do unrolling for you, but you could manually do it.

like image 43
TJD Avatar answered Jan 08 '23 01:01

TJD