Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does coded version of std::all_of() benchmark differently than call to std::all_of()?

Tags:

I've been inspired the 2015 code::dive conference "Writing Fast Code" talk by Andrei Alexandrescu: https://www.youtube.com/watch?v=vrfYLlR8X8k

I took the llvm std::all_of code and benchmarked it against a direct call to std::all_of(). I used google benchmark, and a range of 0 - 65535 elements (that all satisfy the predicate):

#include <benchmark/benchmark.h>

#include <algorithm>

// LLVM std::all_of copied from https://github.com/llvm-mirror/libcxx/blob/master/include/algorithm. Benchmarking this against a direct call to std::all_of().
template <class _InputIterator, class _Predicate>
inline bool
all_of(_InputIterator __first, _InputIterator __last, _Predicate __pred)
{
  for (; __first != __last; ++__first)
    if (!__pred(*__first))
      return false;
  return true;
}

static bool equals_42(int i)
{
    return i == 42;
}

// Benchmark for ::all_of (above)
static void BM_all_of(benchmark::State &state)
{
  std::vector<int> v(state.range(0), 42);
  for (auto _ : state)
    benchmark::DoNotOptimize(::all_of(v.begin(), v.end(), equals_42));
}
// Register the function as a benchmark
BENCHMARK(BM_all_of)->Range(0, 65535);

// Benchmark for std::all_of
static void BM_std_all_of(benchmark::State &state)
{
  std::vector<int> v(state.range(0), 42);
  for (auto _ : state)
    // Note had to use DoNotOptimize because clang++ will optimize std::all_of away otherwise. (I could recreate this with above code by making all_of static or adding __attribute__((internal_linkage))).
    // Note #2: For g++ the effect from DoNotOptimize was marginal, but I left it in to compare apples to apples.
    benchmark::DoNotOptimize(std::all_of(v.begin(), v.end(), equals_42));
}
BENCHMARK(BM_std_all_of)->Range(0, 65535);

BENCHMARK_MAIN();

Benchmarking with clang++:

$ clang++ -Ofast benchmark/bm_algorithm.cpp -isystem ../../benchmark/include -L ../../benchmark/build/src -lbenchmark -lpthread -o bm; ./bm
2021-07-07T15:55:08-07:00
Running ./bm
Run on (16 X 3194.05 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 64 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 8192 KiB (x2)
Load Average: 0.16, 0.15, 0.12
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_all_of/0              0.248 ns        0.248 ns   1000000000
BM_all_of/1              0.510 ns        0.510 ns   1000000000
BM_all_of/8               2.76 ns         2.76 ns    250915933
BM_all_of/64              28.1 ns         28.1 ns     25130057
BM_all_of/512              136 ns          136 ns      5115542
BM_all_of/4096            1084 ns         1084 ns       678951
BM_all_of/32768           8257 ns         8257 ns        84044
BM_all_of/65535          16655 ns        16655 ns        41223
BM_std_all_of/0          0.520 ns        0.520 ns   1000000000
BM_std_all_of/1          0.516 ns        0.516 ns   1000000000
BM_std_all_of/8           2.37 ns         2.37 ns    290607282
BM_std_all_of/64          13.2 ns         13.2 ns     48996825
BM_std_all_of/512          104 ns          104 ns      6693344
BM_std_all_of/4096         779 ns          779 ns       899729
BM_std_all_of/32768       6205 ns         6205 ns       112868
BM_std_all_of/65535      12387 ns        12387 ns        56428

Benchmarking with g++:

$ g++ -Ofast benchmark/bm_algorithm.cpp -isystem ../../benchmark/include -L ../../ben
chmark/build_gcc/src -lbenchmark -lpthread -o bm; ./bm
2021-07-07T16:05:38-07:00
Running ./bm
Run on (16 X 3194.05 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 64 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 8192 KiB (x2)
Load Average: 0.00, 0.02, 0.06
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_all_of/0              0.749 ns        0.749 ns    927807313
BM_all_of/1               1.25 ns         1.25 ns    562992870
BM_all_of/8               4.09 ns         4.09 ns    172825424
BM_all_of/64              28.9 ns         28.9 ns     24147840
BM_all_of/512              139 ns          139 ns      5133907
BM_all_of/4096            1074 ns         1074 ns       677743
BM_all_of/32768           8457 ns         8457 ns        84469
BM_all_of/65535          16650 ns        16650 ns        36781
BM_std_all_of/0           2.32 ns         2.32 ns    304514515
BM_std_all_of/1           3.67 ns         3.67 ns    196181853
BM_std_all_of/8           11.0 ns         11.0 ns     63325378
BM_std_all_of/64          75.1 ns         75.1 ns      9310900
BM_std_all_of/512          550 ns          550 ns      1266670
BM_std_all_of/4096        4351 ns         4351 ns       159969
BM_std_all_of/32768      34867 ns        34867 ns        20133
BM_std_all_of/65535      69588 ns        69589 ns        10025

I expected the code to run the same speed on each compiler. Instead:

  • clang++ (v10.0.0): ::all_of() was ~20% slower than std::all_of() (with larger number of elements).
  • g++ (v9.3.0): ::all_of() was ~4x faster than std::all_of().

Anyone know why these execution times might be different? I'm new to benchmarking this kind of thing.

like image 531
user79878 Avatar asked Jul 07 '21 23:07

user79878


1 Answers

As the comments above say, you're comparing the libstdc++ version of all_of with the libc++ version which you copied into your code, and the libstdc++ version is faster.

As for why it's faster: std::all_of calls std::find_if_not, which calls __find_if_not. This function negates the predicate and calls __find_if, passing it the iterator category as an additional parameter.

__find_if has two overloads: One for input iterators, which is similar to the implementation you showed above. The other is for random access iterators. This overload can calculate the length of the range, which allows it to unroll the loop and perform four predicate calls in each iteration. This is the reason it can work faster on random-access containers such as std::vector.

like image 132
interjay Avatar answered Sep 30 '22 19:09

interjay