Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster computation of (approximate) variance needed

I can see with the CPU profiler, that the compute_variances() is the bottleneck of my project.

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 75.63      5.43     5.43       40   135.75   135.75  compute_variances(unsigned int, std::vector<Point, std::allocator<Point> > const&, float*, float*, unsigned int*)
 19.08      6.80     1.37                             readDivisionSpace(Division_Euclidean_space&, char*)
 ...

Here is the body of the function:

void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
                       float* var, size_t* split_dims) {
  for (size_t d = 0; d < points[0].dim(); d++) {
    avg[d] = 0.0;
    var[d] = 0.0;
  }
  float delta, n;
  for (size_t i = 0; i < points.size(); ++i) {
    n = 1.0 + i;
    for (size_t d = 0; d < points[0].dim(); ++d) {
      delta = (points[i][d]) - avg[d];
      avg[d] += delta / n;
      var[d] += delta * ((points[i][d]) - avg[d]);
    }
  }

  /* Find t dimensions with largest scaled variance. */
  kthLargest(var, points[0].dim(), t, split_dims);
}

where kthLargest() doesn't seem to be a problem, since I see that:

0.00 7.18 0.00 40 0.00 0.00 kthLargest(float*, int, int, unsigned int*)

The compute_variances() takes a vector of vectors of floats (i.e. a vector of Points, where Points is a class I have implemented) and computes the variance of them, in each dimension (with regard to the algorithm of Knuth).

Here is how I call the function:

float avg[(*points)[0].dim()];
float var[(*points)[0].dim()];
size_t split_dims[t];

compute_variances(t, *points, avg, var, split_dims);

The question is, can I do better? I would really happy to pay the trade-off between speed and approximate computation of variances. Or maybe I could make the code more cache friendly or something?

I compiled like this:

g++ main_noTime.cpp -std=c++0x -p -pg -O3 -o eg

Notice, that before edit, I had used -o3, not with a capital 'o'. Thanks to ypnos, I compiled now with the optimization flag -O3. I am sure that there was a difference between them, since I performed time measurements with one of these methods in my pseudo-site.

Note that now, compute_variances is dominating the overall project's time!

[EDIT]

copute_variances() is called 40 times.

Per 10 calls, the following hold true:

points.size() = 1000   and points[0].dim = 10000
points.size() = 10000  and points[0].dim = 100
points.size() = 10000  and points[0].dim = 10000
points.size() = 100000 and points[0].dim = 100

Each call handles different data.

Q: How fast is access to points[i][d]?

A: point[i] is just the i-th element of std::vector, where the second [], is implemented as this, in the Point class.

const FT& operator [](const int i) const {
  if (i < (int) coords.size() && i >= 0)
     return coords.at(i);
  else {
     std::cout << "Error at Point::[]" << std::endl;
     exit(1);
  }
  return coords[0];  // Clear -Wall warning 
}

where coords is a std::vector of float values. This seems a bit heavy, but shouldn't the compiler be smart enough to predict correctly that the branch is always true? (I mean after the cold start). Moreover, the std::vector.at() is supposed to be constant time (as said in the ref). I changed this to have only .at() in the body of the function and the time measurements remained, pretty much, the same.

The division in the compute_variances() is for sure something heavy! However, Knuth's algorithm was a numerical stable one and I was not able to find another algorithm, that would de both numerical stable and without division.

Note that I am not interesting in parallelism right now.

[EDIT.2]

Minimal example of Point class (I think I didn't forget to show something):

class Point {
 public:

  typedef float FT;

  ...

  /**
   * Get dimension of point.
   */
  size_t dim() const {
    return coords.size();
  }

  /**
   * Operator that returns the coordinate at the given index.
   * @param i - index of the coordinate
   * @return the coordinate at index i
   */
  FT& operator [](const int i) {
    return coords.at(i);
    //it's the same if I have the commented code below
    /*if (i < (int) coords.size() && i >= 0)
      return coords.at(i);
    else {
      std::cout << "Error at Point::[]" << std::endl;
      exit(1);
    }
    return coords[0];  // Clear -Wall warning*/
  }

  /**
   * Operator that returns the coordinate at the given index. (constant)
   * @param i - index of the coordinate
   * @return the coordinate at index i
   */
  const FT& operator [](const int i) const {
        return coords.at(i);
    /*if (i < (int) coords.size() && i >= 0)
      return coords.at(i);
    else {
      std::cout << "Error at Point::[]" << std::endl;
      exit(1);
    }
    return coords[0];  // Clear -Wall warning*/
  }

 private:
  std::vector<FT> coords;
};
like image 326
gsamaras Avatar asked May 29 '14 13:05

gsamaras


People also ask

How can you compute the variance with just one pass through the data?

Here is how you calculate variance in one pass: Calculate the mean (average) of your numbers. In the same loop, calculate the mean (average) of your numbers squared. After the loop, variance is the absolute value of #2, minus #1 squared.


1 Answers

1. SIMD

One easy speedup for this is to use vector instructions (SIMD) for the computation. On x86 that means SSE, AVX instructions. Based on your word length and processor you can get speedups of about x4 or even more. This code here:

for (size_t d = 0; d < points[0].dim(); ++d) {
    delta = (points[i][d]) - avg[d];
    avg[d] += delta / n;
    var[d] += delta * ((points[i][d]) - avg[d]);
}

can be sped-up by doing the computation for four elements at once with SSE. As your code really only processes one single element in each loop iteration, there is no bottleneck. If you go down to 16bit short instead of 32bit float (an approximation then), you can fit eight elements in one instruction. With AVX it would be even more, but you need a recent processor for that.

It is not the solution to your performance problem, but just one of them that can also be combined with others.

2. Micro-parallelizm

The second easy speedup when you have that many loops is to use parallel processing. I typically use Intel TBB, others might suggest OpenMP instead. For this you would probably have to change the loop order. So parallelize over d in the outer loop, not over i.

You can combine both techniques, and if you do it right, on a quadcore with HT you might get a speed-up of 25-30 for the combination without any loss in accuracy.

3. Compiler optimization

First of all maybe it is just a typo here on SO, but it needs to be -O3, not -o3! As a general note, it might be easier for the compiler to optimize your code if you declare the variables delta, n within the scope where you actually use them. You should also try the -funroll-loops compiler option as well as -march. The option to the latter depends on your CPU, but nowadays typically -march core2 is fine (also for recent AMDs), and includes SSE optimizations (but I would not trust the compiler just yet to do that for your loop).

like image 128
ypnos Avatar answered Sep 23 '22 18:09

ypnos