I'm playing around with the Range-v3 library to perform a glorified find_if
and was curious why google-benchmark consistently ranks my Range-v3 code worse than my std::find_if
approach. Both g++ and clang give the same pattern with -O3
and #define NDEBUG
.
The specific example I have in mind is the following using the STL:
std::vector<int> lengths(large_number, random_number);
auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;
auto accumulated_length = 0l;
auto found = std::find_if(lengths.begin(), lengths.end(), [&](auto const &val) {
accumulated_length += val;
return to_find < accumulated_length;
});
auto found_index = std::distance(lengths.begin(), found);
This is somewhat contrived for the purpose of this illustration, but there would usually be a random generator for the to_find
variable and random values in the lengths
vector.
Using the Range-v3 library, I get the following code
using namespace ranges;
std::vector<int> lengths(large_number, random_number);
auto const to_find = accumulate(lengths, 0l) / 2;
auto found_index = distance(lengths | view::partial_sum()
| view::take_while([=](auto const i) {
return i < to_find;
}));
My question is why the Range-v3 is slower than the STL implementation. I understand this is still an experimental library, but perhaps there is something wrong with the code example or am I misusing the range concepts?
An example google-bench driver (not sure if correct)
#define NDEBUG
#include <numeric>
#include <vector>
#include <benchmark/benchmark.h>
#include <range/v3/all.hpp>
static void stl_search(benchmark::State &state) {
using namespace ranges;
std::vector<long> lengths(state.range(0), 1l);
auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;
while (state.KeepRunning()) {
auto accumulated_length = 0l;
auto const found = std::find_if(lengths.begin(), lengths.end(), [&](auto const& val) {
accumulated_length += val;
return to_find < accumulated_length;
});
volatile long val = std::distance(lengths.begin(), found);
}
state.SetBytesProcessed(int64_t(state.iterations()) *
int64_t(state.range(0)) * sizeof(long));
}
static void ranges_search(benchmark::State &state) {
using namespace ranges;
std::vector<long> lengths(state.range(0), 1l);
auto const to_find = accumulate(lengths, 0l) / 2;
while (state.KeepRunning())
{
volatile long val = distance(lengths | view::partial_sum()
| view::take_while([=](auto const& i) {
return i <= to_find;
}));
}
state.SetBytesProcessed(int64_t(state.iterations()) *
int64_t(state.range(0)) * sizeof(long));
}
BENCHMARK(ranges_search)->Range(8 << 8, 8 << 16);
BENCHMARK(stl_search)->Range(8 << 8, 8 << 16);
BENCHMARK_MAIN();
Gives
ranges_search/2048 756 ns 756 ns 902091 20.1892GB/s
ranges_search/4096 1495 ns 1494 ns 466681 20.4285GB/s
ranges_search/32768 11872 ns 11863 ns 58902 20.5801GB/s
ranges_search/262144 94982 ns 94892 ns 7364 20.5825GB/s
ranges_search/524288 189870 ns 189691 ns 3688 20.5927GB/s
stl_search/2048 348 ns 348 ns 2000964 43.8336GB/s
stl_search/4096 690 ns 689 ns 1008295 44.2751GB/s
stl_search/32768 5497 ns 5492 ns 126097 44.452GB/s
stl_search/262144 44725 ns 44681 ns 15882 43.7122GB/s
stl_search/524288 91027 ns 90936 ns 7616 42.9563GB/s
with clang 4.0.1 and
ranges_search/2048 2309 ns 2307 ns 298507 6.61496GB/s
ranges_search/4096 4558 ns 4554 ns 154520 6.70161GB/s
ranges_search/32768 36482 ns 36454 ns 19191 6.69726GB/s
ranges_search/262144 287072 ns 286801 ns 2438 6.81004GB/s
ranges_search/524288 574230 ns 573665 ns 1209 6.80928GB/s
stl_search/2048 299 ns 298 ns 2340691 51.1437GB/s
stl_search/4096 592 ns 591 ns 1176783 51.6363GB/s
stl_search/32768 4692 ns 4689 ns 149460 52.0711GB/s
stl_search/262144 37718 ns 37679 ns 18611 51.8358GB/s
stl_search/524288 75247 ns 75173 ns 9244 51.9633GB/s
with gcc 6.3.1. My machine has a Haswell generation processor. Both were compiled and executed with
g++ -Wall -O3 -std=c++14 Ranges.cpp -lbenchmark -lpthread && ./a.out
clang++ -Wall -O3 -std=c++14 Ranges.cpp -lbenchmark -lpthread && ./a.out
view::partial_sum
over a range of int
produces a range of int
. If to_find > INT_MAX
, the internal accumulator will overflow resulting in UB. In practice the algorithm most likely walks the entire input and returns the end iterator.
Conversely, your accumulated_length
in the non-range-v3 approach is a long
. It does not overflow, and thus has defined behavior / returns before processing the entire input.
The range-v3 approach will have correct behavior if you transform
the input range into a range of long
before passing it through partial_sum
:
auto found_index = distance(lengths
| view::transform(convert_to<long>{}) | view::partial_sum()
| view::take_while([=](auto const i) { return i < to_find; }));
Even with this correctness fix, it is still remarkedly slower than using the standard algorithms in my testing. Compiling this test program:
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
#ifdef USE_RV3
#include <range/v3/core.hpp>
#include <range/v3/algorithm.hpp>
#include <range/v3/numeric.hpp>
#include <range/v3/view.hpp>
#else
#include <algorithm>
#include <numeric>
#endif
int main() {
constexpr size_t large_number = 1UL << 30;
int random_number = 42;
std::vector<int> const lengths(large_number, random_number);
using clock_t = std::chrono::steady_clock;
auto const start = clock_t::now();
#ifdef USE_RV3
auto const to_find = ranges::accumulate(lengths, 0l) / 2;
#else
auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;
#endif
auto const elapsed1 = clock_t::now() - start;
#ifdef USE_RV3
auto const found_index = ranges::distance(lengths
| ranges::view::transform(ranges::convert_to<long>{})
| ranges::view::partial_sum()
| ranges::view::take_while([=](auto const i) { return !(to_find < i); }));
#else
auto accumulated_length = 0l;
auto found = std::find_if(lengths.begin(), lengths.end(), [&](auto const &val) {
accumulated_length += val;
return to_find < accumulated_length;
});
auto const found_index = std::distance(lengths.begin(), found);
#endif
auto const elapsed2 = clock_t::now() - start;
std::cout << "elapsed1: "
<< std::chrono::duration<double, std::milli>(elapsed1).count()
<< " ms, to_find: " << to_find << "\n"
"elapsed2: "
<< std::chrono::duration<double, std::milli>(elapsed2).count()
<< " ms, result: " << found_index << '\n';
}
with
g++-6 -std=c++14 -Ofast -march=native -DNDEBUG rv3.cpp -I ~/dev/range-v3/include -S -o -
both without and with -DUSE_RV3
and diffing the assembly output is interesting. The code generated up through the initialization of elapsed1
is identical for both cases. There are notable differences in the intermediate section between the initializations of elapsed1
and elapsed2
. gcc does a much better job of optimizing the std version: the hot loop is all together in a single code run with branches for terminating conditions. The range-v3 version is uglier and jumps around quite a bit; I speculate that we need to wiggle the details of the implementation of partial_sum
to make it more transparent to optimizers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With