I benchmarked the performance of std::none_of
against a three different manual implementations using i) a for
loop, ii) a range-based for
loop and iii) iterators. To my surprise, I found that while all three manual implementations take roughly the same time, std::none_of
is significantly faster. My question is - why is this the case?
I used the Google benchmark library and compiled with -std=c++14 -O3
. When running the test, I restricted the affinity of the process to a single processor. I get the following result using GCC 6.2:
Benchmark Time CPU Iterations
--------------------------------------------------------
benchmarkSTL 28813 ns 28780 ns 24283
benchmarkManual 46203 ns 46191 ns 15063
benchmarkRange 48368 ns 48243 ns 16245
benchmarkIterator 44732 ns 44710 ns 15698
On Clang 3.9, std::none_of
is also faster than the manual for
loop though the speed difference is smaller. Here is the test code (only including the manual for loop for brevity):
#include <algorithm>
#include <array>
#include <benchmark/benchmark.h>
#include <functional>
#include <random>
const size_t N = 100000;
const unsigned value = 31415926;
template<size_t N>
std::array<unsigned, N> generateData() {
std::mt19937 randomEngine(0);
std::array<unsigned, N> data;
std::generate(data.begin(), data.end(), randomEngine);
return data;
}
void benchmarkSTL(benchmark::State & state) {
auto data = generateData<N>();
while (state.KeepRunning()) {
bool result = std::none_of(
data.begin(),
data.end(),
std::bind(std::equal_to<unsigned>(), std::placeholders::_1, value));
assert(result);
}
}
void benchmarkManual(benchmark::State & state) {
auto data = generateData<N>();
while (state.KeepRunning()) {
bool result = true;
for (size_t i = 0; i < N; i++) {
if (data[i] == value) {
result = false;
break;
}
}
assert(result);
}
}
BENCHMARK(benchmarkSTL);
BENCHMARK(benchmarkManual);
BENCHMARK_MAIN();
Note that generating the data using a random number generator is irrelevant. I get the same result when just setting the i
-th element to i
and checking if the value N + 1
is contained.
Looping Performance in C++ Today I was testing the performance of a piece of code, which is basically accessing each element in a container within a for loop. But the result is quite shocking because I found the std::for_each version is 10 times faster than the raw loop.
List comprehension with a separate transform() function is around 17% slower than the initial “for loop”-based version (224/191≈1.173).
STL algorithms are a fantastic set of tool to improve expressiveness and correctness of your code.
After some more investigation, I will try to answer my own question. As suggested by Kerrek SB, I looked at the generated assembly code. The bottom line seems to be that GCC 6.2 does a much better job at unrolling the loop implicit in std::none_of
compared to the other three versions.
GCC 6.2:
std::none_of
is unrolled 4 times -> ~30µsfor
, range for
and iterator are not being unrolled at all -> ~45µsAs suggested by Corristo, the result is compiler dependend - which makes perfect sense. Clang 3.9 unrolls all but the range for
loop, though to varying degrees.
Clang 3.9
for
is unrolled 5 times -> ~35µsfor
is not being unrolled at all -> ~60µsAll code was compiled with -std=c++14 -O3
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With