I've been inspired the 2015 code::dive conference "Writing Fast Code" talk by Andrei Alexandrescu: https://www.youtube.com/watch?v=vrfYLlR8X8k
I took the llvm std::all_of code and benchmarked it against a direct call to std::all_of(). I used google benchmark, and a range of 0 - 65535 elements (that all satisfy the predicate):
#include <benchmark/benchmark.h>
#include <algorithm>
// LLVM std::all_of copied from https://github.com/llvm-mirror/libcxx/blob/master/include/algorithm. Benchmarking this against a direct call to std::all_of().
template <class _InputIterator, class _Predicate>
inline bool
all_of(_InputIterator __first, _InputIterator __last, _Predicate __pred)
{
for (; __first != __last; ++__first)
if (!__pred(*__first))
return false;
return true;
}
static bool equals_42(int i)
{
return i == 42;
}
// Benchmark for ::all_of (above)
static void BM_all_of(benchmark::State &state)
{
std::vector<int> v(state.range(0), 42);
for (auto _ : state)
benchmark::DoNotOptimize(::all_of(v.begin(), v.end(), equals_42));
}
// Register the function as a benchmark
BENCHMARK(BM_all_of)->Range(0, 65535);
// Benchmark for std::all_of
static void BM_std_all_of(benchmark::State &state)
{
std::vector<int> v(state.range(0), 42);
for (auto _ : state)
// Note had to use DoNotOptimize because clang++ will optimize std::all_of away otherwise. (I could recreate this with above code by making all_of static or adding __attribute__((internal_linkage))).
// Note #2: For g++ the effect from DoNotOptimize was marginal, but I left it in to compare apples to apples.
benchmark::DoNotOptimize(std::all_of(v.begin(), v.end(), equals_42));
}
BENCHMARK(BM_std_all_of)->Range(0, 65535);
BENCHMARK_MAIN();
Benchmarking with clang++:
$ clang++ -Ofast benchmark/bm_algorithm.cpp -isystem ../../benchmark/include -L ../../benchmark/build/src -lbenchmark -lpthread -o bm; ./bm
2021-07-07T15:55:08-07:00
Running ./bm
Run on (16 X 3194.05 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 64 KiB (x8)
L2 Unified 512 KiB (x8)
L3 Unified 8192 KiB (x2)
Load Average: 0.16, 0.15, 0.12
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
BM_all_of/0 0.248 ns 0.248 ns 1000000000
BM_all_of/1 0.510 ns 0.510 ns 1000000000
BM_all_of/8 2.76 ns 2.76 ns 250915933
BM_all_of/64 28.1 ns 28.1 ns 25130057
BM_all_of/512 136 ns 136 ns 5115542
BM_all_of/4096 1084 ns 1084 ns 678951
BM_all_of/32768 8257 ns 8257 ns 84044
BM_all_of/65535 16655 ns 16655 ns 41223
BM_std_all_of/0 0.520 ns 0.520 ns 1000000000
BM_std_all_of/1 0.516 ns 0.516 ns 1000000000
BM_std_all_of/8 2.37 ns 2.37 ns 290607282
BM_std_all_of/64 13.2 ns 13.2 ns 48996825
BM_std_all_of/512 104 ns 104 ns 6693344
BM_std_all_of/4096 779 ns 779 ns 899729
BM_std_all_of/32768 6205 ns 6205 ns 112868
BM_std_all_of/65535 12387 ns 12387 ns 56428
Benchmarking with g++:
$ g++ -Ofast benchmark/bm_algorithm.cpp -isystem ../../benchmark/include -L ../../ben
chmark/build_gcc/src -lbenchmark -lpthread -o bm; ./bm
2021-07-07T16:05:38-07:00
Running ./bm
Run on (16 X 3194.05 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 64 KiB (x8)
L2 Unified 512 KiB (x8)
L3 Unified 8192 KiB (x2)
Load Average: 0.00, 0.02, 0.06
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
BM_all_of/0 0.749 ns 0.749 ns 927807313
BM_all_of/1 1.25 ns 1.25 ns 562992870
BM_all_of/8 4.09 ns 4.09 ns 172825424
BM_all_of/64 28.9 ns 28.9 ns 24147840
BM_all_of/512 139 ns 139 ns 5133907
BM_all_of/4096 1074 ns 1074 ns 677743
BM_all_of/32768 8457 ns 8457 ns 84469
BM_all_of/65535 16650 ns 16650 ns 36781
BM_std_all_of/0 2.32 ns 2.32 ns 304514515
BM_std_all_of/1 3.67 ns 3.67 ns 196181853
BM_std_all_of/8 11.0 ns 11.0 ns 63325378
BM_std_all_of/64 75.1 ns 75.1 ns 9310900
BM_std_all_of/512 550 ns 550 ns 1266670
BM_std_all_of/4096 4351 ns 4351 ns 159969
BM_std_all_of/32768 34867 ns 34867 ns 20133
BM_std_all_of/65535 69588 ns 69589 ns 10025
I expected the code to run the same speed on each compiler. Instead:
Anyone know why these execution times might be different? I'm new to benchmarking this kind of thing.
As the comments above say, you're comparing the libstdc++ version of all_of
with the libc++ version which you copied into your code, and the libstdc++ version is faster.
As for why it's faster: std::all_of
calls std::find_if_not
, which calls __find_if_not
. This function negates the predicate and calls __find_if
, passing it the iterator category as an additional parameter.
__find_if
has two overloads: One for input iterators, which is similar to the implementation you showed above. The other is for random access iterators. This overload can calculate the length of the range, which allows it to unroll the loop and perform four predicate calls in each iteration. This is the reason it can work faster on random-access containers such as std::vector
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With