Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is std::accumulate so slow? [duplicate]

I'm trying to sum array elements using a simple for loop, an std::accumulate and a manualy unrolled for loop. As I expect, the manually unrolled loop is the fastest one, but more interesting is that std::accumulate is much slower than simple loop. This is my code, I compiled it with gcc 4.7 with -O3 flag. Visual Studio will need different rdtsc function implementation.

#include <iostream>
#include <algorithm>
#include <numeric>
#include <stdint.h>


using namespace std;

__inline__ uint64_t rdtsc() {
  uint64_t a, d;
  __asm__ volatile ("rdtsc" : "=a" (a), "=d" (d));
  return (d<<32) | a;
}

class mytimer
{
 public:
  mytimer() { _start_time = rdtsc(); }
  void   restart() { _start_time = rdtsc(); }
  uint64_t elapsed() const
  { return  rdtsc() - _start_time; }

 private:
  uint64_t _start_time;
}; // timer

int main()
{
    const int num_samples = 1000;
    float* samples = new float[num_samples];
    mytimer timer;
    for (int i = 0; i < num_samples; i++) {
        samples[i] = 1.f;
    }
    double result = timer.elapsed();
    std::cout << "rewrite of " << (num_samples*sizeof(float)/(1024*1024)) << " Mb takes " << result << std::endl;

    timer.restart();
    float sum = 0;
    for (int i = 0; i < num_samples; i++) {
        sum += samples[i];
    }
    result = timer.elapsed();
    std::cout << "naive:\t\t" << result << ", sum = " << sum << std::endl;

    timer.restart();
    float* end = samples + num_samples;
    sum = 0;
    for(float* i = samples; i < end; i++) {
        sum += *i;
    }
    result = timer.elapsed();
    std::cout << "pointers:\t\t" << result << ", sum = " << sum << std::endl;

    timer.restart();
    sum = 0;
    sum = std::accumulate(samples, end, 0);
    result = timer.elapsed();
    std::cout << "algorithm:\t" << result << ", sum = " << sum << std::endl;

    // With ILP
    timer.restart();
    float sum0 = 0, sum1 = 0;
    sum = 0;
    for (int i = 0; i < num_samples; i+=2) {
        sum0 += samples[i];
        sum1 += samples[i+1];
    }
    sum = sum0 + sum1;
    result = timer.elapsed();
    std::cout << "ILP:\t\t" << result << ", sum = " << sum << std::endl;
}
like image 519
Evgeny Lazin Avatar asked Feb 10 '14 18:02

Evgeny Lazin


2 Answers

For starters, your use of std::accumulate is summing integers. So you're probably paying the cost of converting each of the floating point to integer before adding it. Try:

sum = std::accumulate( samples, end, 0.f );

and see if that doesn't make a difference.

like image 129
James Kanze Avatar answered Nov 09 '22 01:11

James Kanze


Since you (apparently) care about doing this fast, you might also consider trying to multi-thread the computation to take advantage of all available cores. I did a pretty trivial rewrite of your naive loop to use OpenMP, giving this:

timer.restart();
sum = 0;

// only real change is adding the following line:
#pragma omp parallel for schedule(dynamic, 4096), reduction(+:sum)
for (int i = 0; i < num_samples; i++) {
    sum += samples[i];
}
result = timer.elapsed();
std::cout << "OMP:\t\t" << result << ", sum = " << sum << std::endl;

Just for grins, I also did a little rewriting on your unrolled loop to allow semi-arbitrary unrolling, and adding OpenMP as well:

static const int unroll = 32;
real total = real();
timer.restart();
double sum[unroll] = { 0.0f };
#pragma omp parallel for reduction(+:total) schedule(dynamic, 4096)
for (int i = 0; i < num_samples; i += unroll) {
    for (int j = 0; j < unroll; j++)
        total += samples[i + j];
}
result = timer.elapsed();
std::cout << "ILP+OMP:\t" << result << ", sum = " << total << std::endl;

I also increased the array size (substantially) to get somewhat more meaningful numbers. The results were as follows. First for a dual-core AMD:

rewrite of 4096 Mb takes 8269023193
naive:          3336194526, sum = 536870912
pointers:       3348790101, sum = 536870912
algorithm:      3293786903, sum = 536870912
ILP:            2713824079, sum = 536870912
OMP:            1885895124, sum = 536870912
ILP+OMP:        1618134382, sum = 536870912

Then for a quad-core (Intel i7):

rewrite of 4096 Mb takes 2415836465
naive:          1382962075, sum = 536870912
pointers:       1675826109, sum = 536870912
algorithm:      1748990122, sum = 536870912
ILP:            751649497, sum = 536870912
OMP:            575595251, sum = 536870912
ILP+OMP:        450832023, sum = 536870912

From the looks of things, the OpenMP versions are probably hitting limitations on memory bandwidth--the OpenMP versions make more use of the CPU than the un-threaded versions, but still only get to around 70% or so, indicating some other than the CPU is acting as a bottleneck.

like image 4
Jerry Coffin Avatar answered Nov 09 '22 01:11

Jerry Coffin