Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is accumulate faster than a simple for cycle?

I was testing algorithms and run into this weird behavior, when std::accumulate is faster than a simple for cycle.

Looking at the generated assembler I'm not much wiser :-) It seems that the for cycle is optimized into MMX instructions, while accumulate expands into a loop.

This is the code. The behavior manifests with -O3 optimization level, gcc 4.7.1

#include <vector>                                                                                                                                                                                                                                                              
#include <chrono>                                                                                                                                                                                                                                                              
#include <iostream>                                                                                                                                                                                                                                                            
#include <random>                                                                                                                                                                                                                                                              
#include <algorithm>                                                                                                                                                                                                                                                           
using namespace std;                                                                                                                                                                                                                                                           

int main()                                                                                                                                                                                                                                                                     
{                                                                                                                                                                                                                                                                              
    const size_t vsize = 100*1000*1000;                                                                                                                                                                                                                                        

    vector<int> x;
    x.reserve(vsize);

    mt19937 rng;
    rng.seed(chrono::system_clock::to_time_t(chrono::system_clock::now()));

    uniform_int_distribution<uint32_t> dist(0,10);

    for (size_t i = 0; i < vsize; i++)
    {
        x.push_back(dist(rng));
    }

    long long tmp = 0;
    for (size_t i = 0; i < vsize; i++)
    {
        tmp += x[i];
    }

    cout << "dry run " << tmp << endl;

    auto start = chrono::high_resolution_clock::now();
    long long suma = accumulate(x.begin(),x.end(),0);
    auto end = chrono::high_resolution_clock::now();

    cout << "Accumulate runtime " << chrono::duration_cast<chrono::nanoseconds>(end-start).count() << " - " << suma << endl;

    start = chrono::high_resolution_clock::now();
    suma = 0;
    for (size_t i = 0; i < vsize; i++)
    {
        suma += x[i];
    }
    end = chrono::high_resolution_clock::now();

    cout << "Manual sum runtime " << chrono::duration_cast<chrono::nanoseconds>(end-start).count() << " - " << suma <<  endl;

    return 0;
}
like image 849
Šimon Tóth Avatar asked Nov 06 '12 01:11

Šimon Tóth


People also ask

Is std :: accumulate slow?

As I expect, the manually unrolled loop is the fastest one, but more interesting is that std::accumulate is much slower than simple loop.

Why use accumulate in c++?

std::accumulate() is a built-in function in C++'s Standard Template Library. The function takes in a beginning iterator, an ending iterator, initial value, and (by default) computes the sum of the given initial value and the elements in the given range. The function can also​ be used for left folding.


3 Answers

When you pass the 0 to accumulate, you are making it accumulate using an int instead of a long long.

If you code your manual loop like this, it will be equivalent:

int sumb = 0;
for (size_t i = 0; i < vsize; i++)
{
    sumb += x[i];
}
suma = sumb;

or you can call accumulate like this:

long long suma = accumulate(x.begin(),x.end(),0LL);
like image 164
Vaughn Cato Avatar answered Nov 03 '22 00:11

Vaughn Cato


I have some different results using Visual Studio 2012

// original code
Accumulate runtime 93600 ms
Manual sum runtime 140400 ms

Note that the original std::accumulate code isn't equivalent to the for loop because the third parameter to std::accumulate is an int 0 value. It performs the summation using an int and only at the end stores the result in a long long. Changing the third parameter to 0LL forces the algorithm to use a long long accumulator and results in the following times.

// change std::accumulate initial value -> 0LL
Accumulate runtime 265200 ms
Manual sum runtime 140400 ms

Since the final result fits in an int I changed suma and std::accumulate back to using only int values. After this change the MSVC 2012 compiler was able to auto-vectorize the for loop and resulted in the following times.

// change suma from long long to int
Accumulate runtime 93600 ms
Manual sum runtime 46800 ms
like image 27
Blastfurnace Avatar answered Nov 03 '22 00:11

Blastfurnace


After fixing the accumulate issue others noted I tested with both Visual Studio 2008 & 2010 and accumulate was indeed faster than the manual loop.

Looking at the disassembly I saw some additional iterator checking being done in the manual loop so I switched to just a raw array to eliminate it.

Here's what I ended up testing with:

#include <Windows.h>
#include <iostream>
#include <numeric>
#include <stdlib.h>

int main() 
{
    const size_t vsize = 100*1000*1000;                                                                                                                                                                                                                                        
    int* x = new int[vsize];

    for (size_t i = 0; i < vsize; i++) x[i] = rand() % 1000;

    LARGE_INTEGER start,stop;
    long long suma = 0, sumb = 0, timea = 0, timeb = 0;

    QueryPerformanceCounter( &start );
    suma = std::accumulate(x, x + vsize, 0LL);
    QueryPerformanceCounter( &stop );
    timea = stop.QuadPart - start.QuadPart;

    QueryPerformanceCounter( &start );
    for (size_t i = 0; i < vsize; ++i) sumb += x[i];
    QueryPerformanceCounter( &stop );
    timeb = stop.QuadPart - start.QuadPart;

    std::cout << "Accumulate: " << timea << " - " << suma << std::endl;
    std::cout << "      Loop: " << timeb << " - " << sumb << std::endl;

    delete [] x;
    return 0;
}

Accumulate: 633942 - 49678806711
      Loop: 292642 - 49678806711

Using this code, the manual loop easily beats accumulate. The big difference is the compiler unrolled the manual loop 4 times, otherwise the generated code is almost identical.

like image 37
Retired Ninja Avatar answered Nov 03 '22 00:11

Retired Ninja