Why is Range-v3 slower than the STL in this example?

Question

I'm playing around with the Range-v3 library to perform a glorified find_if and was curious why google-benchmark consistently ranks my Range-v3 code worse than my std::find_if approach. Both g++ and clang give the same pattern with -O3 and #define NDEBUG.

The specific example I have in mind is the following using the STL:

std::vector<int> lengths(large_number, random_number);

auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;
auto accumulated_length = 0l;
auto found = std::find_if(lengths.begin(), lengths.end(), [&](auto const &val) {
                              accumulated_length += val;
                              return to_find < accumulated_length;
                          });
auto found_index = std::distance(lengths.begin(), found);

This is somewhat contrived for the purpose of this illustration, but there would usually be a random generator for the to_find variable and random values in the lengths vector.

Using the Range-v3 library, I get the following code

using namespace ranges;    

std::vector<int> lengths(large_number, random_number);

auto const to_find = accumulate(lengths, 0l) / 2;

auto found_index = distance(lengths | view::partial_sum()
                                    | view::take_while([=](auto const i) {
                                         return i < to_find;
                                      }));

My question is why the Range-v3 is slower than the STL implementation. I understand this is still an experimental library, but perhaps there is something wrong with the code example or am I misusing the range concepts?

Edit

An example google-bench driver (not sure if correct)

#define NDEBUG

#include <numeric>
#include <vector>

#include <benchmark/benchmark.h>

#include <range/v3/all.hpp>

static void stl_search(benchmark::State &state) {

    using namespace ranges;

    std::vector<long> lengths(state.range(0), 1l);

    auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;

    while (state.KeepRunning()) {

        auto accumulated_length = 0l;
        auto const found = std::find_if(lengths.begin(), lengths.end(), [&](auto const& val) {
                               accumulated_length += val;
                               return to_find < accumulated_length;
                           });
        volatile long val = std::distance(lengths.begin(), found);
    }
    state.SetBytesProcessed(int64_t(state.iterations()) *
                            int64_t(state.range(0)) * sizeof(long));
}

static void ranges_search(benchmark::State &state) {

    using namespace ranges;

    std::vector<long> lengths(state.range(0), 1l);

    auto const to_find = accumulate(lengths, 0l) / 2;

    while (state.KeepRunning())
    {
        volatile long val = distance(lengths | view::partial_sum()
                                             | view::take_while([=](auto const& i) {
                                                   return i <= to_find;
                                               }));
    }
    state.SetBytesProcessed(int64_t(state.iterations()) *
                            int64_t(state.range(0)) * sizeof(long));
}

BENCHMARK(ranges_search)->Range(8 << 8, 8 << 16);
BENCHMARK(stl_search)->Range(8 << 8, 8 << 16);

BENCHMARK_MAIN();

Gives

ranges_search/2048          756 ns        756 ns     902091   20.1892GB/s
ranges_search/4096         1495 ns       1494 ns     466681   20.4285GB/s
ranges_search/32768       11872 ns      11863 ns      58902   20.5801GB/s
ranges_search/262144      94982 ns      94892 ns       7364   20.5825GB/s
ranges_search/524288     189870 ns     189691 ns       3688   20.5927GB/s
stl_search/2048             348 ns        348 ns    2000964   43.8336GB/s
stl_search/4096             690 ns        689 ns    1008295   44.2751GB/s
stl_search/32768           5497 ns       5492 ns     126097    44.452GB/s
stl_search/262144         44725 ns      44681 ns      15882   43.7122GB/s
stl_search/524288         91027 ns      90936 ns       7616   42.9563GB/s

with clang 4.0.1 and

ranges_search/2048         2309 ns       2307 ns     298507   6.61496GB/s
ranges_search/4096         4558 ns       4554 ns     154520   6.70161GB/s
ranges_search/32768       36482 ns      36454 ns      19191   6.69726GB/s
ranges_search/262144     287072 ns     286801 ns       2438   6.81004GB/s
ranges_search/524288     574230 ns     573665 ns       1209   6.80928GB/s
stl_search/2048             299 ns        298 ns    2340691   51.1437GB/s
stl_search/4096             592 ns        591 ns    1176783   51.6363GB/s
stl_search/32768           4692 ns       4689 ns     149460   52.0711GB/s
stl_search/262144         37718 ns      37679 ns      18611   51.8358GB/s
stl_search/524288         75247 ns      75173 ns       9244   51.9633GB/s

with gcc 6.3.1. My machine has a Haswell generation processor. Both were compiled and executed with

g++ -Wall -O3 -std=c++14 Ranges.cpp -lbenchmark -lpthread && ./a.out
clang++ -Wall -O3 -std=c++14 Ranges.cpp -lbenchmark -lpthread && ./a.out

Casey · Accepted Answer

view::partial_sum over a range of int produces a range of int. If to_find > INT_MAX, the internal accumulator will overflow resulting in UB. In practice the algorithm most likely walks the entire input and returns the end iterator.

Conversely, your accumulated_length in the non-range-v3 approach is a long. It does not overflow, and thus has defined behavior / returns before processing the entire input.

The range-v3 approach will have correct behavior if you transform the input range into a range of long before passing it through partial_sum:

auto found_index = distance(lengths
  | view::transform(convert_to<long>{}) | view::partial_sum()
  | view::take_while([=](auto const i) { return i < to_find; }));

Even with this correctness fix, it is still remarkedly slower than using the standard algorithms in my testing. Compiling this test program:

#include <chrono>
#include <iostream>
#include <random>
#include <vector>

#ifdef USE_RV3
#include <range/v3/core.hpp>
#include <range/v3/algorithm.hpp>
#include <range/v3/numeric.hpp>
#include <range/v3/view.hpp>

#else
#include <algorithm>
#include <numeric>
#endif

int main() {
    constexpr size_t large_number = 1UL << 30;

    int random_number = 42;

    std::vector<int> const lengths(large_number, random_number);

    using clock_t = std::chrono::steady_clock;
    auto const start = clock_t::now();

#ifdef USE_RV3
    auto const to_find = ranges::accumulate(lengths, 0l) / 2;
#else
    auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;
#endif

    auto const elapsed1 = clock_t::now() - start;

#ifdef USE_RV3
    auto const found_index = ranges::distance(lengths
            | ranges::view::transform(ranges::convert_to<long>{})
            | ranges::view::partial_sum()
            | ranges::view::take_while([=](auto const i) { return !(to_find < i); }));
#else
    auto accumulated_length = 0l;
    auto found = std::find_if(lengths.begin(), lengths.end(), [&](auto const &val) {
                                accumulated_length += val;
                                return to_find < accumulated_length;
                            });
    auto const found_index = std::distance(lengths.begin(), found);
#endif

    auto const elapsed2 = clock_t::now() - start;

    std::cout << "elapsed1: "
        << std::chrono::duration<double, std::milli>(elapsed1).count()
        << " ms, to_find: " << to_find << "
"
           "elapsed2: "
        << std::chrono::duration<double, std::milli>(elapsed2).count()
        << " ms, result: " << found_index << '
';
}

with

g++-6 -std=c++14 -Ofast -march=native -DNDEBUG rv3.cpp -I ~/dev/range-v3/include -S -o -

both without and with -DUSE_RV3 and diffing the assembly output is interesting. The code generated up through the initialization of elapsed1 is identical for both cases. There are notable differences in the intermediate section between the initializations of elapsed1 and elapsed2. gcc does a much better job of optimizing the std version: the hot loop is all together in a single code run with branches for terminating conditions. The range-v3 version is uglier and jumps around quite a bit; I speculate that we need to wiggle the details of the implementation of partial_sum to make it more transparent to optimizers.

Why is Range-v3 slower than the STL in this example?

Tags:

c++

performance

range-v3

Edit

fast asleep

1 Answers

Casey

Recent Activity

Donate For Us

Why is Range-v3 slower than the STL in this example?

Tags:

c++

performance

range-v3

Edit

fast asleep

1 Answers

Casey

Related questions

Recent Activity

Donate For Us