I benchmarked the performance of <code>std::none_of</code> against a three different manual implementations using i) a <code>for</code> loop, ii) a range-based <code>for</code> loop and iii) iterators. To my surprise, I found that while all three manual implementations take roughly the same time, <code>std::none_of</code> is significantly faster. My question is - why is this the case? I used the Google benchmark library and compiled with <code>-std=c++14 -O3</code>. When running the test, I restricted the affinity of the process to a single processor. I get the following result using GCC 6.2: <pre class="prettyprint"><code>Benchmark Time CPU Iterations -------------------------------------------------------- benchmarkSTL 28813 ns 28780 ns 24283 benchmarkManual 46203 ns 46191 ns 15063 benchmarkRange 48368 ns 48243 ns 16245 benchmarkIterator 44732 ns 44710 ns 15698 </code></pre> On Clang 3.9, <code>std::none_of</code> is also faster than the manual <code>for</code> loop though the speed difference is smaller. Here is the test code (only including the manual for loop for brevity): <pre class="prettyprint"><code>#include <algorithm> #include <array> #include <benchmark/benchmark.h> #include <functional> #include <random> const size_t N = 100000; const unsigned value = 31415926; template<size_t N> std::array<unsigned, N> generateData() { std::mt19937 randomEngine(0); std::array<unsigned, N> data; std::generate(data.begin(), data.end(), randomEngine); return data; } void benchmarkSTL(benchmark::State & state) { auto data = generateData<N>(); while (state.KeepRunning()) { bool result = std::none_of( data.begin(), data.end(), std::bind(std::equal_to<unsigned>(), std::placeholders::_1, value)); assert(result); } } void benchmarkManual(benchmark::State & state) { auto data = generateData<N>(); while (state.KeepRunning()) { bool result = true; for (size_t i = 0; i < N; i++) { if (data[i] == value) { result = false; break; } } assert(result); } } BENCHMARK(benchmarkSTL); BENCHMARK(benchmarkManual); BENCHMARK_MAIN(); </code></pre> Note that generating the data using a random number generator is irrelevant. I get the same result when just setting the <code>i</code>-th element to <code>i</code> and checking if the value <code>N + 1</code> is contained.

After some more investigation, I will try to answer my own question. As suggested by Kerrek SB, I looked at the generated assembly code. The bottom line seems to be that GCC 6.2 does a much better job at unrolling the loop implicit in <code>std::none_of</code> compared to the other three versions. GCC 6.2: <ul> <li> <code>std::none_of</code> is unrolled 4 times -> ~30µs</li> <li>manual <code>for</code>, range <code>for</code> and iterator are not being unrolled at all -> ~45µs</li> </ul> As suggested by Corristo, the result is compiler dependend - which makes perfect sense. Clang 3.9 unrolls all but the range <code>for</code> loop, though to varying degrees. Clang 3.9 <ul> <li>`std::none_of' is unrolled 8 times -> ~30µs</li> <li>manual <code>for</code> is unrolled 5 times -> ~35µs</li> <li>range <code>for</code> is not being unrolled at all -> ~60µs</li> <li>iterator is unrolled 8 times -> ~28µs</li> </ul> All code was compiled with <code>-std=c++14 -O3</code>.

Why is std::none_of faster than a hand rolled loop?

Tags:

c++

stl-algorithm

I benchmarked the performance of std::none_of against a three different manual implementations using i) a for loop, ii) a range-based for loop and iii) iterators. To my surprise, I found that while all three manual implementations take roughly the same time, std::none_of is significantly faster. My question is - why is this the case?

I used the Google benchmark library and compiled with -std=c++14 -O3. When running the test, I restricted the affinity of the process to a single processor. I get the following result using GCC 6.2:

Benchmark                  Time           CPU Iterations
--------------------------------------------------------
benchmarkSTL           28813 ns      28780 ns      24283
benchmarkManual        46203 ns      46191 ns      15063
benchmarkRange         48368 ns      48243 ns      16245
benchmarkIterator      44732 ns      44710 ns      15698

On Clang 3.9, std::none_of is also faster than the manual for loop though the speed difference is smaller. Here is the test code (only including the manual for loop for brevity):

#include <algorithm>
#include <array>
#include <benchmark/benchmark.h>
#include <functional>
#include <random>

const size_t N = 100000;
const unsigned value = 31415926;

template<size_t N>
std::array<unsigned, N> generateData() {
    std::mt19937 randomEngine(0);
    std::array<unsigned, N> data;
    std::generate(data.begin(), data.end(), randomEngine);
    return data;
}

void benchmarkSTL(benchmark::State & state) {
    auto data = generateData<N>();
    while (state.KeepRunning()) {
        bool result = std::none_of(
            data.begin(),
            data.end(),
            std::bind(std::equal_to<unsigned>(), std::placeholders::_1, value));
        assert(result);
    }
}

void benchmarkManual(benchmark::State & state) {
    auto data = generateData<N>();
    while (state.KeepRunning()) {
        bool result = true;
        for (size_t i = 0; i < N; i++) {
            if (data[i] == value) {
                result = false;
                break;
            }
        }
        assert(result);
    }
}

BENCHMARK(benchmarkSTL);
BENCHMARK(benchmarkManual);

BENCHMARK_MAIN();

Note that generating the data using a random number generator is irrelevant. I get the same result when just setting the i-th element to i and checking if the value N + 1 is contained.

827

asked Oct 02 '16 17:10

LocalVolatility

1 Answers

After some more investigation, I will try to answer my own question. As suggested by Kerrek SB, I looked at the generated assembly code. The bottom line seems to be that GCC 6.2 does a much better job at unrolling the loop implicit in std::none_of compared to the other three versions.

GCC 6.2:

std::none_of is unrolled 4 times -> ~30µs
manual for, range for and iterator are not being unrolled at all -> ~45µs

As suggested by Corristo, the result is compiler dependend - which makes perfect sense. Clang 3.9 unrolls all but the range for loop, though to varying degrees.

Clang 3.9

`std::none_of' is unrolled 8 times -> ~30µs
manual for is unrolled 5 times -> ~35µs
range for is not being unrolled at all -> ~60µs
iterator is unrolled 8 times -> ~28µs

All code was compiled with -std=c++14 -O3.

answered Sep 29 '22 18:09

LocalVolatility

Related questions
                            
                                Multiline preprocessor macro with trailing comments
                            
                                What's the right way to work with a different C++ compiler in a CDT project?
                            
                                How to improve the poor performances of OpenMP on Android?
                            
                                Per-monitor DPI-Aware: black window glitch with NVIDIA Optimus
                            
                                Why does std::string::substr throw an exception instead of returning an empty string? [closed]
                            
                                OpenCV VideoCapture timeout on Open or Constructor?
                            
                                Using an enum as a constant expression. Which compiler is right?
                            
                                What are C++ Standard guarantees on relation between min and max values of signed integer types?
                            
                                <function-style-cast> error: Cannot convert from 'initializer list' to 'std::thread'
                            
                                Sending float values on socket C/C++
                            
                                How to create custom integer sequence in C++
                            
                                Multi platform C++ project setup and tools
                            
                                How to set structure element at desired offset
                            
                                Setting Up Code::Blocks with MySql Connector on Windows 7 64bit
                            
                                Inconsistency in function type decay between variadic/non-variadic templates?
                            
                                CppUTest Unit Testing Framework Multiple Definition Exception
                            
                                Detect if `cout` outputs to a terminal which accepts colors properly?
                            
                                Reading large strings in C++ -- is there a safe fast way?
                            
                                How to pass argument by reference from MQL4 to C++ DLL
                            
                                Parenthesis around placement new operator for arrays

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is std::none_of faster than a hand rolled loop?

Tags:

c++

stl-algorithm

LocalVolatility

People also ask

1 Answers

LocalVolatility

Recent Activity

Donate For Us