c++ why std::async slower than sequential execution

Question

#include <future>
#include <iostream>
#include <vector>
#include <cstdint>
#include <algorithm>
#include <random>
#include <chrono>
#include <utility>
#include <type_traits>

template <class Clock = std::chrono::high_resolution_clock, class Task>
double timing(Task&& t, typename std::result_of<Task()>::type* r = nullptr)
{
  using namespace std::chrono;
  auto begin = Clock::now();
  if (r != nullptr) *r = std::forward<Task>(t)();
  auto end = Clock::now();
  return duration_cast<duration<double>>(end - begin).count();
}

template <typename Num>
double sum(const std::vector<Num>& v, const std::size_t l, const std::size_t h)
{
  double s;
  for (auto i = l; i <= h; i++) s += v[i];
  return s;
}

template <typename Num>
double asum(const std::vector<Num>& v, const std::size_t l, const std::size_t h)
{
  auto m = (l + h) / 2;
  auto s1 = std::async(std::launch::async, sum<Num>, v, l, m);
  auto s2 = std::async(std::launch::async, sum<Num>, v, m+1, h);
  return s1.get() + s2.get();
}

int main()
{
  std::vector<uint> v(1000);
  auto s = std::chrono::system_clock::now().time_since_epoch().count();
  std::generate(v.begin(), v.end(), std::minstd_rand0(s));

  double r;
  std::cout << 1000 * timing([&]() -> double { return asum(v, 0, v.size() - 1); }, &r) << " msec | rst " << r << std::endl;
  std::cout << 1000 * timing([&]() -> double { return sum(v, 0, v.size() - 1); }, &r) << " msec | rst " << r << std::endl;
}

Hi,

So above are two functions for summing up a vector of random numbers.

I did several runs, but it seems that I did not benefit from std::async. Below are some results I got.

0.130582 msec | rst 1.09015e+12
0.001402 msec | rst 1.09015e+12

0.23185 msec | rst 1.07046e+12
0.002308 msec | rst 1.07046e+12

0.18052 msec | rst 1.07449e+12
0.00244 msec | rst 1.07449e+12

0.190455 msec | rst 1.08319e+12
0.002315 msec | rst 1.08319e+12

All four cases the async version spent more time. But ideally I should have been two times faster right?

Did I miss anything in my code?

By the way I am running on OS X 10.10.4 Macbook Air with 1.4 GHz Intel Core i5.

Thanks,

Edits:

compiler flags: g++ -o asum asum.cpp -std=c++11
I changed the flag to include -O3 and vector size to be 10000000, but the results are still weired.

72.1743 msec | rst 1.07349e+16
14.3739 msec | rst 1.07349e+16

58.3542 msec | rst 1.07372e+16
12.1143 msec | rst 1.07372e+16

57.1576 msec | rst 1.07371e+16
11.9332 msec | rst 1.07371e+16

59.9104 msec | rst 1.07395e+16
11.9923 msec | rst 1.07395e+16

64.032 msec | rst 1.07371e+16
12.0929 msec | rst 1.07371e+16

Deepak Kr Gupta · Accepted Answer

Well, this is the simplest possible example and the results should not be binding because of following reasons.

when you create a thread, it takes some extra CPU cycles to create thread context and stack. Those cycles are added to the sum function.
When main thread runs this code, the main thread was empty and not doing anything else other than doing the sum

we go for multithreaded solution only when we can't accomplish something in a single thread or we need to synchronously wait for some input.

Just by creating threads, you don't increase performance. You increase performance by carefully designing multithreaded applications where you can use empty CPU's when some other things are waiting for example IO

Massimiliano Janes · Answer

here

auto s1 = std::async(std::launch::async, sum<Num>, v, l, m);
auto s2 = std::async(std::launch::async, sum<Num>, v, m+1, h);

async will store its own vector copy, twice. You should use std::cref and make sure the futures are retrieved before the vector dies ( as it is in your current code ) and that accesses get properly synchronized ( as it is in your current code ).

As mentioned in comments, thread creation overhead may further slow down the code.

hungptit · Answer

First, the performance of your original async function is bad compared with the sequential because it make one more copy of your test data as mentioned in other answers. Second, you might not be able to see the improvement after fixing copying issue because creating threads is not cheap and it can kill your performance gain.

From the benchmark results, I can see async version is 1.88 times faster than that of the sequential version for N = 1000000. However, if I use N = 10000 then async version is 3.55 times slower. Both non-iterator and iterator solutions produce similar results.

Beside that you should use iterator when writing your code because this approach is more flexible for example you can try different container types, will give you similar performance compared to C style version, and it also is more elegant IMHO :).

Benchmark results:

-----------------------------------------------------------------------------------------------------------------------------------------------
     Group      |   Experiment    |   Prob. Space   |     Samples     |   Iterations    |    Baseline     |  us/Iteration   | Iterations/sec  |

-----------------------------------------------------------------------------------------------------------------------------------------------
sum             | orginal         |            Null |              50 |              10 |         1.00000 |      1366.30000 |          731.90 |
sum             | orginal_async   |            Null |              50 |              10 |         0.53246 |       727.50000 |         1374.57 |
sum             | iter            |            Null |              50 |              10 |         1.00022 |      1366.60000 |          731.74 |
sum             | iter_async      |            Null |              50 |              10 |         0.53261 |       727.70000 |         1374.19 |
Complete.

Celero
Timer resolution: 0.001000 us
-----------------------------------------------------------------------------------------------------------------------------------------------
     Group      |   Experiment    |   Prob. Space   |     Samples     |   Iterations    |    Baseline     |  us/Iteration   | Iterations/sec  |

-----------------------------------------------------------------------------------------------------------------------------------------------
sum             | orginal         |            Null |              50 |              10 |         1.00000 |        13.60000 |        73529.41 |
sum             | orginal_async   |            Null |              50 |              10 |         3.55882 |        48.40000 |        20661.16 |
sum             | iter            |            Null |              50 |              10 |         1.00000 |        13.60000 |        73529.41 |
sum             | iter_async      |            Null |              50 |              10 |         3.53676 |        48.10000 |        20790.02 |
Complete.

Complete code sample

#include <algorithm>
#include <chrono>
#include <future>
#include <iostream>
#include <random>
#include <type_traits>
#include <vector>
#include <iostream>

#include "celero/Celero.h"

constexpr int NumberOfSamples = 50;
constexpr int NumberOfIterations = 10;

template <typename Container>
double sum(const Container &v, const std::size_t begin, const std::size_t end) {
    double s = 0;
    for (auto idx = begin; idx < end; ++idx) {
        s += v[idx];
    }
    return s;
}

template <typename Container>
double sum_async(const Container &v, const std::size_t begin, const std::size_t end) {
    auto middle = (begin + end) / 2;
        // Removing std::cref will slow down this function because it makes two copy of v..
    auto s1 = std::async(std::launch::async, sum<Container>, std::cref(v), begin, middle);
    auto s2 = std::async(std::launch::async, sum<Container>, std::cref(v), middle, end);s,
    return s1.get() + s2.get();
}

template <typename Iter>
typename std::iterator_traits<Iter>::value_type sum_iter(Iter begin, Iter end) {
    typename std::iterator_traits<Iter>::value_type results = 0.0;
    std::for_each(begin, end, [&results](auto const item) { results += item; });
    return results;
}

template <typename Iter>
typename std::iterator_traits<Iter>::value_type sum_iter_async(Iter begin, Iter end) {
    Iter middle = begin + std::distance(begin, end) / 2;
    auto s1 = std::async(std::launch::async, sum_iter<Iter>, begin, middle);
    auto s2 = std::async(std::launch::async, sum_iter<Iter>, middle, end);
    return s1.get() + s2.get();
}

template <typename T> auto create_test_data(const size_t N) {
    auto s = std::chrono::system_clock::now().time_since_epoch().count();
    std::vector<T> v(N);
    std::generate(v.begin(), v.end(), std::minstd_rand0(s));
    return v;
}

// Create test data
constexpr size_t N = 10000;
using value_type = double;
auto data = create_test_data<value_type>(N);
using container_type = decltype(data);

CELERO_MAIN

BASELINE(sum, orginal, NumberOfSamples, NumberOfIterations) {
    celero::DoNotOptimizeAway(sum<container_type>(data, 0, N));
}

BENCHMARK(sum, orginal_async, NumberOfSamples, NumberOfIterations) {
    celero::DoNotOptimizeAway(sum_async<container_type>(data, 0, N));
}

BENCHMARK(sum, iter, NumberOfSamples, NumberOfIterations) {
    celero::DoNotOptimizeAway(sum_iter(data.cbegin(), data.cend()));
}

BENCHMARK(sum, iter_async, NumberOfSamples, NumberOfIterations) {
    celero::DoNotOptimizeAway(sum_iter_async(data.cbegin(), data.cend()));

c++ why std::async slower than sequential execution

Tags:

c++

asynchronous

multithreading

c++11

Lin

3 Answers

Deepak Kr Gupta

Massimiliano Janes

hungptit

Recent Activity

Donate For Us

c++ why std::async slower than sequential execution

Tags:

c++

asynchronous

multithreading

c++11

Lin

3 Answers

Deepak Kr Gupta

Massimiliano Janes

hungptit

Related questions

Recent Activity

Donate For Us