I was testing algorithms and run into this weird behavior, when std::accumulate
is faster than a simple for
cycle.
Looking at the generated assembler I'm not much wiser :-) It seems that the for
cycle is optimized into MMX instructions, while accumulate expands into a loop.
This is the code. The behavior manifests with -O3
optimization level, gcc 4.7.1
#include <vector>
#include <chrono>
#include <iostream>
#include <random>
#include <algorithm>
using namespace std;
int main()
{
const size_t vsize = 100*1000*1000;
vector<int> x;
x.reserve(vsize);
mt19937 rng;
rng.seed(chrono::system_clock::to_time_t(chrono::system_clock::now()));
uniform_int_distribution<uint32_t> dist(0,10);
for (size_t i = 0; i < vsize; i++)
{
x.push_back(dist(rng));
}
long long tmp = 0;
for (size_t i = 0; i < vsize; i++)
{
tmp += x[i];
}
cout << "dry run " << tmp << endl;
auto start = chrono::high_resolution_clock::now();
long long suma = accumulate(x.begin(),x.end(),0);
auto end = chrono::high_resolution_clock::now();
cout << "Accumulate runtime " << chrono::duration_cast<chrono::nanoseconds>(end-start).count() << " - " << suma << endl;
start = chrono::high_resolution_clock::now();
suma = 0;
for (size_t i = 0; i < vsize; i++)
{
suma += x[i];
}
end = chrono::high_resolution_clock::now();
cout << "Manual sum runtime " << chrono::duration_cast<chrono::nanoseconds>(end-start).count() << " - " << suma << endl;
return 0;
}
As I expect, the manually unrolled loop is the fastest one, but more interesting is that std::accumulate is much slower than simple loop.
std::accumulate() is a built-in function in C++'s Standard Template Library. The function takes in a beginning iterator, an ending iterator, initial value, and (by default) computes the sum of the given initial value and the elements in the given range. The function can also be used for left folding.
When you pass the 0
to accumulate, you are making it accumulate using an int instead of a long long.
If you code your manual loop like this, it will be equivalent:
int sumb = 0;
for (size_t i = 0; i < vsize; i++)
{
sumb += x[i];
}
suma = sumb;
or you can call accumulate like this:
long long suma = accumulate(x.begin(),x.end(),0LL);
I have some different results using Visual Studio 2012
// original code
Accumulate runtime 93600 ms
Manual sum runtime 140400 ms
Note that the original std::accumulate
code isn't equivalent to the for
loop because the third parameter to std::accumulate
is an int
0 value. It performs the summation using an int
and only at the end stores the result in a long long
. Changing the third parameter to 0LL
forces the algorithm to use a long long
accumulator and results in the following times.
// change std::accumulate initial value -> 0LL
Accumulate runtime 265200 ms
Manual sum runtime 140400 ms
Since the final result fits in an int
I changed suma
and std::accumulate
back to using only int
values. After this change the MSVC 2012 compiler was able to auto-vectorize the for
loop and resulted in the following times.
// change suma from long long to int
Accumulate runtime 93600 ms
Manual sum runtime 46800 ms
After fixing the accumulate issue others noted I tested with both Visual Studio 2008 & 2010 and accumulate was indeed faster than the manual loop.
Looking at the disassembly I saw some additional iterator checking being done in the manual loop so I switched to just a raw array to eliminate it.
Here's what I ended up testing with:
#include <Windows.h>
#include <iostream>
#include <numeric>
#include <stdlib.h>
int main()
{
const size_t vsize = 100*1000*1000;
int* x = new int[vsize];
for (size_t i = 0; i < vsize; i++) x[i] = rand() % 1000;
LARGE_INTEGER start,stop;
long long suma = 0, sumb = 0, timea = 0, timeb = 0;
QueryPerformanceCounter( &start );
suma = std::accumulate(x, x + vsize, 0LL);
QueryPerformanceCounter( &stop );
timea = stop.QuadPart - start.QuadPart;
QueryPerformanceCounter( &start );
for (size_t i = 0; i < vsize; ++i) sumb += x[i];
QueryPerformanceCounter( &stop );
timeb = stop.QuadPart - start.QuadPart;
std::cout << "Accumulate: " << timea << " - " << suma << std::endl;
std::cout << " Loop: " << timeb << " - " << sumb << std::endl;
delete [] x;
return 0;
}
Accumulate: 633942 - 49678806711
Loop: 292642 - 49678806711
Using this code, the manual loop easily beats accumulate. The big difference is the compiler unrolled the manual loop 4 times, otherwise the generated code is almost identical.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With