Avoiding recursive template instantiation overflow in parallel recursive asynchronous algorithms

Question

This issue is easier to explain through a simplified example (since my real situation is far from "minimal"): given a...

template <typename T>
void post_in_thread_pool(T&& f)

...function template, I want to create a parallel asynchronous algorithm that has a tree-like recursive structure. I am going to write an example of the structure below using std::count_if as a placeholder. The strategy I am going to use is as follows:

If the length of the range I am checking is smaller than 64, I will fall back to the sequential std::count_if function. (0)
If it's bigger or equal than 64, I will spawn a job in the thread pool that recurses on left half of the range, and compute the right half of the range on the current thread. (1)
- I will use an atomic shared int to "wait" for the two halves to be computed. (2)
- I will use an atomic shared int to accumulate the partial results. (3)

Simplified code:

auto async_count_if(auto begin, auto end, auto predicate, auto continuation)
{
    // (0) Base case:  
    if(end - begin < 64)
    {
        continuation(std::count_if(begin, end, predicate));
        return;
    }

    // (1) Recursive case:
    auto counter = make_shared<atomic<int>>(2); // (2)
    auto cleanup = [=, accumulator = make_shared<atomic<int>>(0) /*(3)*/]
                   (int partial_result)
    {
        *accumulator += partial_result; 

        if(--*counter == 0)
        {
            continuation(*accumulator);
        }
    };

    const auto mid = std::next(i_begin, sz / 2);                

    post_in_thread_pool([=]
    {
        async_count_if(i_begin, mid, predicate, cleanup);
    });

    async_count_if(mid, i_end, predicate, cleanup);
}

The code can then be used as follows:

std::vector<int> v(512);
std::iota(std::begin(v), std::end(v), 0);

async_count_if{}(std::begin(v), std::end(v), 
/*    predicate */ [](auto x){ return x < 256; }, 
/* continuation */ [](auto res){ std::cout << res << std::endl; });

The problem in the code above is auto cleanup. Since auto will be deduced to a unique type for every instantiation of the cleanup lambda, and since cleanup capture cont by value... an infinitely big nested lambda type will be computed at compile-time due to the recursion, resulting in the following error:

fatal error: recursive template instantiation exceeded maximum depth of 1024

wandbox example

Conceptually, you can think of the type being built up roughly like this:

cont                                // user-provided continuation
cleanup0<cont>                      // recursive step 0
cleanup1<cleanup0<cont>>            // recursive step 1
cleanup2<cleanup1<cleanup0<cont>>>  // recursive step 2
// ...

(!): keep in mind that async_count_if is just an example, to show the "tree-like" recursive structure of my real situation. I am aware that an asynchronous count_if could be trivially implemented with a single atomic counter and sz / 64 tasks.

I would like to avoid the error, minimizing any possible run-time or memory overhead.

One possible solution is using std::function<void(int)> cleanup, which allows the code to compile and run correctly, but produces sub-optimal assembly and introduces extra dynamic allocations. wandbox example
- Another possible solution is using a std::size_t template parameter + specialization to artificially limit async_count_if::operator()'s recursion depth - this unfortunately can bloat the binary size and is quite inelegant.

What bothers me is that I know the size of the range when I call async_count_if: it is std::distance(i_begin, i_end). If I know the size of the range, I can also deduce the number of required counters and continuations : (2^k - 1), where k is the depth of the recursion tree.

Therefore, I think that there should be a way of pre-computing a "control structure" in the first invocation of async_count_if and pass it down to the recursive calls by reference. This "control structure" could contain enough space for (2^k - 1) atomic counters and (2^k - 1) cleanup/continuation functions.

I unfortunately could not find a clean way to implement this, and decided to post a question here as it seems like this problem should be common while developing asynchronous parallel recursive algorithm.

What's an elegant way of dealing with this issue without introducing unnecessary overhead?

Oleg Bogdanov · Accepted Answer

I must be missing smth very obvious but why do you need multiple counters and structures? you can precompute total count of iterations (if you know your base case) and share it along with accumulator throughout all iterations, a la (had to modify your simplified code a bit):

#include <algorithm>
#include <memory>
#include <vector>
#include <iostream>
#include <numeric>
#include <future>

using namespace std;

template <class T>
auto post_in_thread_pool(T&& work)
{
    std::async(std::launch::async, work);
}

template <class It, class Pred, class Cont>
auto async_count_if(It begin, It end, Pred predicate, Cont continuation)
{
    // (0) Base case:  
    if(end - begin <= 64)
    {
        continuation(std::count_if(begin, end, predicate));
        return;
    }

    const auto sz = std::distance(begin, end);
    const auto mid = std::next(begin, sz / 2);                

    post_in_thread_pool([=]
    {
         async_count_if(begin, mid, predicate, continuation);
    });

    async_count_if(mid, end, predicate, continuation);
}

template <class It, class Pred, class Cont>
auto async_count_if_facade(It begin, It end, Pred predicate, Cont continuation)
{
    // (1) Recursive case:
    const auto sz = std::distance(begin, end);
    auto counter = make_shared<atomic<int>>(sz / 64); // (fix this for mod 64 !=0 cases)
    auto cleanup = [=, accumulator = make_shared<atomic<int>>(0) /*(3)*/]
                   (int partial_result)
    {
        *accumulator += partial_result; 

        if(--*counter == 0)
        {
            continuation(*accumulator);
        }
    };

    return async_count_if(begin, end, predicate, cleanup);
}

int main ()
{
    std::vector<int> v(1024);
    std::iota(std::begin(v), std::end(v), 0);

    async_count_if_facade(std::begin(v), std::end(v), 
    /*    predicate */ [](auto x){ return x > 1000; }, 
    /* continuation */ [](const auto& res){ std::cout << res << std::endl; });
}

Some demo

Yakk - Adam Nevraumont · Answer

Your use of atomic integers to synchronize is shared mutable state. Shared mutable state kills performance in parallel algorithms. Your shared state is shared over every thread.

Don't do that.

template<class T>
auto sink_into_pointer( T* target ) {
  return [target](T x){*target=x;};
}
template<class T>
auto sink_into_promise( std::promise<T>& p ) {
  return [&p](T x){p.set_value(x);};
}
void async_count_if(auto begin, auto end, auto predicate, auto continuation) {
  // (0) Base case:  
  if(end - begin < 64)
  {
    continuation(std::count_if(begin, end, std::move(predicate)));
    return;
  }

  std::promise< int > sub_count;
  auto sub_count_value = sub_count.get_future();

  auto sub_count_task = sink_into_promise(sub_count);
  // (1) Recursive case:
  const auto mid = std::next(i_begin, sz / 2);        

  post_in_thread_pool(
    [sub_count_task]()mutable
    {
      async_count_if(i_begin, mid, predicate, sub_count_task);
    }
  );

  int second_half = 0;
  auto second_sub_count = sink_into_pointer(&second_half);

  async_count_if(mid, i_end, predicate, second_sub_count);

  continuation( second_half + sub_count_value.get() );
}

In this case, the only shared state between threads is the values returned via the packaged_tasks, and the thread pool manager.

When writing parallel code, your goal should be to maximize parallelism more than you maximize speed within a given thread. Contention of shared resources and the like are going to cause worse scaling problems than following a function pointer once per thread.

Avoiding recursive template instantiation overflow in parallel recursive asynchronous algorithms

Tags:

c++

asynchronous

multithreading

recursion

c++14

Vittorio Romeo

2 Answers

Oleg Bogdanov

Yakk - Adam Nevraumont

Recent Activity

Donate For Us

Avoiding recursive template instantiation overflow in parallel recursive asynchronous algorithms

Tags:

c++

asynchronous

multithreading

recursion

c++14

Vittorio Romeo

2 Answers

Oleg Bogdanov

Yakk - Adam Nevraumont

Related questions

Recent Activity

Donate For Us