Are STL algorithms optimized for speed?

Tags:

I was testing the speed of different ways loop on a std::vector. In the code below, I consider 5 ways to calculate the sum of all elements of a vector of N = 10000000 elements:

using iterators
using integer indices
using integer indices, unrolling by a factor 2
using integer indices, unrolling by a factor 4
using std::accumulate

The code is compiled with g++ for windows, the command line used to compile is:

g++ -std=c++11 -O3 loop.cpp -o loop.exe

I ran the code 4 times, measuring the time of each method, I get the following results (time in microseconds, max and min are given):

Iterators: 8002 - 8007
Int indices: 8004 - 9003
Unroll 2: 6004 - 7005
Unroll 4: 4001 - 5004
accumulate: 8005 - 9007

What these experiments seem to indicate is:

Looping with iterators vs integer indices does not make much difference, at least with full optimization.
Unrolling the loop pays off
Surprisingly, the stl::accumulate gives the worse performance.

While the conclusions 1 and 2 were somewat expected, the number 3 is quite surprising. Don't all books say to use the STL algorithms instead of writing loops by myself?

Am I making any mistake in the way I am measuring the time, or in the way I interprete the results? Do you guys get a different scenario in case you try out the code given below?

#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>

using namespace std;
using namespace std::chrono;



int main()
{
    const int N = 10000000;
    vector<int> v(N);
    for (int i = 0; i<N; ++i)
        v[i] = i;

    //looping with iterators
    {
        high_resolution_clock::time_point t1 = high_resolution_clock::now();

        long long int sum = 0;
        for (auto it = v.begin(); it != v.end(); ++it)
            sum+=*it;

        high_resolution_clock::time_point t2 = high_resolution_clock::now();

        auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();

        cout << duration << "microseconds  output = " << sum << " (Iterators)\n";
    }

    //looping with integers
    {
        high_resolution_clock::time_point t1 = high_resolution_clock::now();

        long long int sum = 0;
        for (int i = 0; i<N; ++i)
            sum+=v[i];

        high_resolution_clock::time_point t2 = high_resolution_clock::now();

        auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();

        cout << duration << "microseconds  output = " << sum << " (integer index)\n";
    }

    //looping with integers (UNROLL 2)
    {
        high_resolution_clock::time_point t1 = high_resolution_clock::now();

        long long int sum = 0;
        for (int i = 0; i<N; i+=2)
            sum+=v[i]+v[i+1];

        high_resolution_clock::time_point t2 = high_resolution_clock::now();

        auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();

        cout << duration << "microseconds  output = " << sum << " (integer index, UNROLL 2)\n";
    }

    //looping with integers (UNROLL 4)
    {
        high_resolution_clock::time_point t1 = high_resolution_clock::now();

        long long int sum = 0;
        for (int i = 0; i<N; i+=4)
            sum+=v[i]+v[i+1]+v[i+2]+v[i+3];

        high_resolution_clock::time_point t2 = high_resolution_clock::now();

        auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();

        cout << duration << "microseconds  output = " << sum << " (integer index, UNROLL 4)\n";
    }

    //using std::accumulate
    {
        high_resolution_clock::time_point t1 = high_resolution_clock::now();

        long long int sum = accumulate(v.begin(), v.end(), static_cast<long long int>(0));

        high_resolution_clock::time_point t2 = high_resolution_clock::now();

        auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();

        cout << duration << "microseconds  output = " << sum << " (std::accumulate)\n";
    }
    return 0;
}

990

asked Mar 17 '15 23:03

Giuseppe

1 Answers

The reason for using the standard library algorithms is not to get better efficiency, it is to allow you to think at a higher level of abstraction.

While there might be some cases where the algorithm will be faster than your own hand-rolled code, that's not what they're there for. One of the great advantages of C++ is that it allows you to bypass the built-in libraries when you have a specific need. If your benchmarking has shown that the standard library is causing a critical slowdown, you are free to explore classic alternatives such as loop unrolling. For most purposes that will never be necessary.

With that said, a well written standard library algorithm will never be horribly slower than your own straight-forward implementation, unless you're taking advantage of knowledge of the specifics of your data.

167

answered Oct 15 '22 01:10

Mark Ransom

Related questions
                            
                                How the has_trivial_default_constructor works?
                            
                                C++ Linked List Node with class
                            
                                How to get a file descriptor from a std::basic_ios for clang on OS X?
                            
                                Suggested max size for stack allocations
                            
                                C++ method overloading: base and derived parameters
                            
                                How to use LZMA SDK in C++?
                            
                                How to permanently override HOMEBREW_CC and HOMEBREW_CXX settings?
                            
                                Could non-static member variable be modified in constexpr constructor (C++14)?
                            
                                Efficient algorithm to compute the median of pariwise absolute sums of a sorted array
                            
                                OpenSSL and signals
                            
                                Is there a CUDA equivalent to std::numeric_limits?
                            
                                'TypeInfo<char>(char *)' isn't defined but worked pre-C++11; what changed, and how can I fix the error?
                            
                                Explicitly defaulted constructors in empty and non-empty struct
                            
                                Can a template template parameter default reference other template type parameters?
                            
                                Is there a C++ equivalent of WaitforSingleObject?
                            
                                volatile and const volatile std::tuple and std::get
                            
                                Strange behavior of program in GNU C++, using floating-point numbers
                            
                                SFINAE enable_if explicit constructor
                            
                                Can I get an unspecialized vector<bool> type in C++?
                            
                                How do I use comparator with is_transparent type?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Are STL algorithms optimized for speed?

Tags:

c++

algorithm

stl

Giuseppe

People also ask

1 Answers

Mark Ransom

Recent Activity

Donate For Us