I was testing the speed of different ways loop on a std::vector. In the code below, I consider 5 ways to calculate the sum of all elements of a vector of N = 10000000 elements:
The code is compiled with g++ for windows, the command line used to compile is:
g++ -std=c++11 -O3 loop.cpp -o loop.exe
I ran the code 4 times, measuring the time of each method, I get the following results (time in microseconds, max and min are given):
What these experiments seem to indicate is:
Looping with iterators vs integer indices does not make much difference, at least with full optimization.
Unrolling the loop pays off
Surprisingly, the stl::accumulate gives the worse performance.
While the conclusions 1 and 2 were somewat expected, the number 3 is quite surprising. Don't all books say to use the STL algorithms instead of writing loops by myself?
Am I making any mistake in the way I am measuring the time, or in the way I interprete the results? Do you guys get a different scenario in case you try out the code given below?
#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>
using namespace std;
using namespace std::chrono;
int main()
{
const int N = 10000000;
vector<int> v(N);
for (int i = 0; i<N; ++i)
v[i] = i;
//looping with iterators
{
high_resolution_clock::time_point t1 = high_resolution_clock::now();
long long int sum = 0;
for (auto it = v.begin(); it != v.end(); ++it)
sum+=*it;
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
cout << duration << "microseconds output = " << sum << " (Iterators)\n";
}
//looping with integers
{
high_resolution_clock::time_point t1 = high_resolution_clock::now();
long long int sum = 0;
for (int i = 0; i<N; ++i)
sum+=v[i];
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
cout << duration << "microseconds output = " << sum << " (integer index)\n";
}
//looping with integers (UNROLL 2)
{
high_resolution_clock::time_point t1 = high_resolution_clock::now();
long long int sum = 0;
for (int i = 0; i<N; i+=2)
sum+=v[i]+v[i+1];
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
cout << duration << "microseconds output = " << sum << " (integer index, UNROLL 2)\n";
}
//looping with integers (UNROLL 4)
{
high_resolution_clock::time_point t1 = high_resolution_clock::now();
long long int sum = 0;
for (int i = 0; i<N; i+=4)
sum+=v[i]+v[i+1]+v[i+2]+v[i+3];
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
cout << duration << "microseconds output = " << sum << " (integer index, UNROLL 4)\n";
}
//using std::accumulate
{
high_resolution_clock::time_point t1 = high_resolution_clock::now();
long long int sum = accumulate(v.begin(), v.end(), static_cast<long long int>(0));
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
cout << duration << "microseconds output = " << sum << " (std::accumulate)\n";
}
return 0;
}
STL is usually faster at runtime than either C-style solutions with callback pointers or polymorphism-based solutions with virtual methods (see also this Bjarne Stroustrup's keynote).
How STL algorithms are different from the conventional algorithms? Algorithms are functions that can be used generally across a variety of containers for processing their contents. Conventional algorithms are very lengthy and complex, where STL algorithms are very easy and short that can save a lot of time and effort.
Looping Performance in C++ Today I was testing the performance of a piece of code, which is basically accessing each element in a container within a for loop. But the result is quite shocking because I found the std::for_each version is 10 times faster than the raw loop.
List comprehension with a separate transform() function is around 17% slower than the initial "for loop"-based version (224/191≈1.173). But it's much more readable, so I prefer it over the other solutions.
The reason for using the standard library algorithms is not to get better efficiency, it is to allow you to think at a higher level of abstraction.
While there might be some cases where the algorithm will be faster than your own hand-rolled code, that's not what they're there for. One of the great advantages of C++ is that it allows you to bypass the built-in libraries when you have a specific need. If your benchmarking has shown that the standard library is causing a critical slowdown, you are free to explore classic alternatives such as loop unrolling. For most purposes that will never be necessary.
With that said, a well written standard library algorithm will never be horribly slower than your own straight-forward implementation, unless you're taking advantage of knowledge of the specifics of your data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With