Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is std::none_of faster than a hand rolled loop?

I benchmarked the performance of std::none_of against a three different manual implementations using i) a for loop, ii) a range-based for loop and iii) iterators. To my surprise, I found that while all three manual implementations take roughly the same time, std::none_of is significantly faster. My question is - why is this the case?

I used the Google benchmark library and compiled with -std=c++14 -O3. When running the test, I restricted the affinity of the process to a single processor. I get the following result using GCC 6.2:

Benchmark                  Time           CPU Iterations
--------------------------------------------------------
benchmarkSTL           28813 ns      28780 ns      24283
benchmarkManual        46203 ns      46191 ns      15063
benchmarkRange         48368 ns      48243 ns      16245
benchmarkIterator      44732 ns      44710 ns      15698

On Clang 3.9, std::none_of is also faster than the manual for loop though the speed difference is smaller. Here is the test code (only including the manual for loop for brevity):

#include <algorithm>
#include <array>
#include <benchmark/benchmark.h>
#include <functional>
#include <random>

const size_t N = 100000;
const unsigned value = 31415926;

template<size_t N>
std::array<unsigned, N> generateData() {
    std::mt19937 randomEngine(0);
    std::array<unsigned, N> data;
    std::generate(data.begin(), data.end(), randomEngine);
    return data;
}

void benchmarkSTL(benchmark::State & state) {
    auto data = generateData<N>();
    while (state.KeepRunning()) {
        bool result = std::none_of(
            data.begin(),
            data.end(),
            std::bind(std::equal_to<unsigned>(), std::placeholders::_1, value));
        assert(result);
    }
}

void benchmarkManual(benchmark::State & state) {
    auto data = generateData<N>();
    while (state.KeepRunning()) {
        bool result = true;
        for (size_t i = 0; i < N; i++) {
            if (data[i] == value) {
                result = false;
                break;
            }
        }
        assert(result);
    }
}

BENCHMARK(benchmarkSTL);
BENCHMARK(benchmarkManual);

BENCHMARK_MAIN();

Note that generating the data using a random number generator is irrelevant. I get the same result when just setting the i-th element to i and checking if the value N + 1 is contained.

like image 827
LocalVolatility Avatar asked Oct 02 '16 17:10

LocalVolatility


People also ask

Is std :: Find faster than a for loop?

Looping Performance in C++ Today I was testing the performance of a piece of code, which is basically accessing each element in a container within a for loop. But the result is quite shocking because I found the std::for_each version is 10 times faster than the raw loop.

Is transform faster than for loop?

List comprehension with a separate transform() function is around 17% slower than the initial “for loop”-based version (224/191≈1.173).

Why use std algorithm?

STL algorithms are a fantastic set of tool to improve expressiveness and correctness of your code.


1 Answers

After some more investigation, I will try to answer my own question. As suggested by Kerrek SB, I looked at the generated assembly code. The bottom line seems to be that GCC 6.2 does a much better job at unrolling the loop implicit in std::none_of compared to the other three versions.

GCC 6.2:

  • std::none_of is unrolled 4 times -> ~30µs
  • manual for, range for and iterator are not being unrolled at all -> ~45µs

As suggested by Corristo, the result is compiler dependend - which makes perfect sense. Clang 3.9 unrolls all but the range for loop, though to varying degrees.

Clang 3.9

  • `std::none_of' is unrolled 8 times -> ~30µs
  • manual for is unrolled 5 times -> ~35µs
  • range for is not being unrolled at all -> ~60µs
  • iterator is unrolled 8 times -> ~28µs

All code was compiled with -std=c++14 -O3.

like image 89
LocalVolatility Avatar answered Sep 29 '22 18:09

LocalVolatility