Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why would concurrency using std::async be faster than using std::thread?

I was reading Chapter 8 of the "Modern C++ Programming Cookbook, 2nd edition" on concurrency and stumbled upon something that puzzles me.

The author implements different versions of parallel map and reduce functions using std::thread and std::async. The implementations are really close; for example, the heart of the parallel_map functions are

// parallel_map using std::async
...
tasks.emplace_back(std::async(
  std::launch::async,
  [=, &f] {std::transform(begin, last, begin, std::forward<F>(f)); }));
...

// parallel_map using std::thread
...
threads.emplace_back([=, &f] {std::transform(begin, last, begin, std::forward<F>(f)); });
...

The complete code can be found here for std::thread and there for std::async.

What puzzles me is that the computation times reported in the book give a significant and consistent advantage to the std::async implementation. Moreover, the author acknowledge this fact as being obvious, without providing any hint of justification:

If we compare this [result with async] with the results from the parallel version using threads, we will find that these are faster execution times and that the speedup is significant, especially for the fold function.

I ran the code above on my computer, and even though the differences are not as compelling as in the book, I find that the std::async implementation is indeed faster than the std::thread one. (The author also later brings in standard implementations of these algorithms, which are even faster). On my computer, the code runs with four threads, which corresponds to the number of physical cores of my CPU.

Maybe I missed something, but why is it obvious that std::async should run faster than std::thread on this example? My intuition was that std::async being a higher-level implementation of threads, it should take at least the same amount of time, if not more, than threads -- obviously I was wrong. Are those findings consistent, as suggested by the book, and what is the explanation?

like image 709
user209974 Avatar asked Apr 10 '21 13:04

user209974


People also ask

What is std :: async?

The function template async runs the function f asynchronously (potentially in a separate thread which might be a part of a thread pool) and returns a std::future that will eventually hold the result of that function call. 1) Behaves as if (2) is called with policy being std::launch::async | std::launch::deferred.

Does STD async use thread pool?

For now, we know that if no policy is specified, then std::async launches a callable function in a separate thread. However, the C++ standard does not specify whether the thread is a new one or reused from a thread pool. Let us see how each of the three implementations launches a callable function.

How does STD async work?

std::async. Calls fn (with args as arguments) at some point, returning without waiting for the execution of fn to complete. The value returned by fn can be accessed through the future object returned (by calling its member future::get ).

How does C++ async work?

As the name indicates, C++ async is a function template fn, which takes functions or function objects as arguments (basically called callbacks) and runs them asynchronously. It returns the std:: the future object which is used to keep the result of the above function. The result is stored in the shared state.

Should I use concurrency to speed up my application?

Concurrent approaches should never be used to speed up small computation problems. Reason for this is the time required to initialize and finalize libraries, mutexes, semaphores etc has a big impact on the output. On the other hand for a large computation problem, even though there is an extra overhead, the overall performance increases.

Does concurrency increase the performance of a program?

Is Concurrency Really Increase the Performance? If you want to increase the performance of your program one possible solution is to add concurrent programming techniques. Basically, in concurrent execution, multiple threads of the same program executes at the same time. It is similar to adding more workers to complete a job.

What is async in C++?

std::async is an easy way to do multiple things concurrently, without the hurdle of manual thread management in C++. Like batch converting images, database calls, http requests, you name it.

Is it possible to execute more than one thread at a time?

Since there are multiple cores, it is possible to execute more than one thread at a time. On a single core processor, actual parallelism is not possible. However, it tries to archive parallelism by switching between threads quickly.


2 Answers

My original interpretation was incorrect. Refer to @OznOg's answer below.

Modified Answer:

I created a simple benchmark that uses std::async and std::thread to do some tiny tasks:

#include <thread>
#include <chrono>
#include <vector>
#include <future>
#include <iostream>

__thread volatile int you_shall_not_optimize_this;

void work() {
    // This is the simplest way I can think of to prevent the compiler and
    // operating system from doing naughty things
    you_shall_not_optimize_this = 42;
}

[[gnu::noinline]]
std::chrono::nanoseconds benchmark_threads(size_t count) {
    std::vector<std::optional<std::thread>> threads;
    threads.resize(count);

    auto before = std::chrono::high_resolution_clock::now();

    for (size_t i = 0; i < count; ++i)
        threads[i] = std::thread { work };

    for (size_t i = 0; i < count; ++i)
        threads[i]->join();

    threads.clear();

    auto after = std::chrono::high_resolution_clock::now();

    return after - before;
}

[[gnu::noinline]]
std::chrono::nanoseconds benchmark_async(size_t count, std::launch policy) {
    std::vector<std::optional<std::future<void>>> results;
    results.resize(count);

    auto before = std::chrono::high_resolution_clock::now();

    for (size_t i = 0; i < count; ++i)
        results[i] = std::async(policy, work);

    for (size_t i = 0; i < count; ++i)
        results[i]->wait();

    results.clear();

    auto after = std::chrono::high_resolution_clock::now();

    return after - before;
}

std::ostream& operator<<(std::ostream& stream, std::launch value)
{
    if (value == std::launch::async)
        return stream << "std::launch::async";
    else if (value == std::launch::deferred)
        return stream << "std::launch::deferred";
    else
        return stream << "std::launch::unknown";
}

// #define CONFIG_THREADS true
// #define CONFIG_ITERATIONS 10000
// #define CONFIG_POLICY std::launch::async

int main() {
    std::cout << "Running benchmark:\n"
              << "  threads?     " << std::boolalpha << CONFIG_THREADS << '\n'
              << "  iterations   " << CONFIG_ITERATIONS << '\n'
              << "  async policy " << CONFIG_POLICY << std::endl;

    std::chrono::nanoseconds duration;
    if (CONFIG_THREADS) {
        duration = benchmark_threads(CONFIG_ITERATIONS);
    } else {
        duration = benchmark_async(CONFIG_ITERATIONS, CONFIG_POLICY);
    }

    std::cout << "Completed in " << duration.count() << "ns (" << std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() << "ms)\n";
}

I've run the benchmark as follows:

$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
  threads?     false
  iterations   10000
  async policy std::launch::deferred
Completed in 4783327ns (4ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
  threads?     false
  iterations   10000
  async policy std::launch::async
Completed in 301756775ns (301ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
  threads?     true
  iterations   10000
  async policy std::launch::deferred
Completed in 291284997ns (291ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
  threads?     true
  iterations   10000
  async policy std::launch::async
Completed in 293539858ns (293ms)

I re-ran all the benchmarks with strace attached and accumulated the system calls made:

# std::async with std::launch::async
      1 access
      2 arch_prctl
     36 brk
  10000 clone
      6 close
      1 execve
      1 exit_group
  10002 futex
  10028 mmap
  10009 mprotect
   9998 munmap
      7 newfstatat
      6 openat
      7 pread64
      1 prlimit64
      5 read
      2 rt_sigaction
  20001 rt_sigprocmask
      1 set_robust_list
      1 set_tid_address
      5 write

# std::async with std::launch::deferred
      1 access
      2 arch_prctl
     11 brk
      6 close
      1 execve
      1 exit_group
  10002 futex
     28 mmap
      9 mprotect
      2 munmap
      7 newfstatat
      6 openat
      7 pread64
      1 prlimit64
      5 read
      2 rt_sigaction
      1 rt_sigprocmask
      1 set_robust_list
      1 set_tid_address
      5 write

# std::thread with std::launch::async
      1 access
      2 arch_prctl
     27 brk
  10000 clone
      6 close
      1 execve
      1 exit_group
      2 futex
  10028 mmap
  10009 mprotect
   9998 munmap
      7 newfstatat
      6 openat
      7 pread64
      1 prlimit64
      5 read
      2 rt_sigaction
  20001 rt_sigprocmask
      1 set_robust_list
      1 set_tid_address
      5 write

# std::thread with std::launch::deferred
      1 access
      2 arch_prctl
     27 brk
  10000 clone
      6 close
      1 execve
      1 exit_group
      2 futex
  10028 mmap
  10009 mprotect
   9998 munmap
      7 newfstatat
      6 openat
      7 pread64
      1 prlimit64
      5 read
      2 rt_sigaction
  20001 rt_sigprocmask
      1 set_robust_list
      1 set_tid_address
      5 write

We observe that std::async is significantly faster with std::launch::deferred but that everything else doesn't seem to matter as much.

My conclusions are:

  • The current libstdc++ implementation does not take advantage of the fact that std::async doesn't need a new thread for each task.

  • The current libstdc++ implementation does some sort of locking in std::async that std::thread doesn't do.

  • std::async with std::launch::deferred saves setup and destroy costs and is much faster for this case.

My machine is configured as follows:

$ uname -a
Linux linux-2 5.12.1-arch1-1 #1 SMP PREEMPT Sun, 02 May 2021 12:43:58 +0000 x86_64 GNU/Linux
$ g++ --version
g++ (GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ lscpu # truncated
Architecture:                    x86_64
Byte Order:                      Little Endian
CPU(s):                          8
Model name:                      Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

Original Answer:

std::thread is a wrapper for thread objects which are provided by the operating system, they are extremely expensive to create and destroy.

std::async is similar, but there isn't a 1-to-1 mapping between tasks and operating system threads. This could be implemented with thread pools, where threads are reused for multiple tasks.

So std::async is better if you have many small tasks, and std::thread is better if you have a few tasks that are running for long periods of time.

Also if you have things that truly need to happen in parallel, then std::async might not fit very well. (std::thread also can't make such guarantees, but that's the closest you can get.)


Maybe to clarify, in your case std::async saves the overhead from creating and destroying threads.

(Depending on the operating system, you could also lose performance simply by having a lot of threads running. An operating system might have a scheduling strategy where it tries to guarantee that every thread gets executed every so often, thus the scheduler could decide go give the individual threads smaller slices of processing time, thus creating more overhead for switching between threads.)

like image 196
asynts Avatar answered Oct 17 '22 09:10

asynts


Looks like something is not happening like expected. I compiled the whole thing on my fedora and the first results where surprising. I commented out all tests to keep only the 2 ones comparing threads and async. The output looks like confirming the behaviour:

 Thead version result

    size   s map   p map  s fold  p fold
   10000     642     628     751     770
  100000    6533    3444    7985    3338
  500000   14885    5760   13854    6304
 1000000   23428   11398   27795   12129
 2000000   47136   22468   55518   24154
 5000000  118690   55752  139489   60142
10000000  236496  112467  277413  121002
25000000  589277  276750  694742  297832
500000001.17839e+06  5553181.39065e+06  594102

Async version:
    size   s map  p1 map  p2 map  s fold p1 fold p2 fold
   10000     248     232     231     273     282     273
  100000    2323    1562    2827    2757    1536    1766
  500000   12312    5615   12044   14014    6272    7431
 1000000   23585   11701   24060   27851   12376   14109
 2000000   47147   22796   48035   55433   25565   30095
 5000000  118465   59980  119698  140775   62960   68382
10000000  241727  110883  239554  277958  121205  136041

Looks like the async is actually 2 times faster (for small vallues) than the threaded one. Then I used strace to count the number of clone system call done (number of thread created):

 64 clone with threads
 92 clone with async

So looks like the explaination on the time spend to create thread is somehow in contradiction as he async version actually creates as many threads than the thread based one (the difference comes from the fact that there are two version of fold in the async code).

Then I tried to swap both test order of execution (put async before thread), and here are the results:

    size   s map  p1 map  p2 map  s fold p1 fold p2 fold
   10000     653     694     624     718     748     718
  100000    6731    3931    2978    8533    3116    1724
  500000   12406    5839   14589   13895    8427    7072
 1000000   23813   11578   24099   27853   13091   14108
 2000000   47357   22402   48197   55469   24572   33543
 5000000  117923   55869  120303  139061   61801   68281
10000000  234861  111055  239124  277153  121270  136953
    size   s map   p map  s fold  p fold
   10000     232     232     273     328
  100000    6424    3271    8297    4487
  500000   21329    5547   13913    6263
 1000000   23654   11419   27827   12083
 2000000   47230   22763   55653   24135
 5000000  117448   56785  139286   61679
10000000  235394  111021  278177  119805
25000000  589329  279637  696392  301485
500000001.1824e+06  5564431.38722e+06  606279

So now the "thread" version is 2 time faster than async for small values.

Looking at clone calls di not show any differences:

 92 clone
 64 clone

I didn't have much time to go further, but at least on linux, we can consider that there are no differences between the two version (the async could even be seen as less efficient as it requires more threads).

We can see that it has nothing to do with the async/thread problem.

Moreover, if we look at values that realy need computation time, the difference of time is really small and not relevent: 55752us vs 56785us for 5'000'000, and keeps matching for bigger values.

This looks like usual problem with microbenches, we somehow measure latencies of the system more than the computation time itself.

Note: the figures shown are without optimization (original code) ; adding -O3 obviously speeds up the computations, but results show the same: there is not real difference in computation time on big values.

like image 24
OznOg Avatar answered Oct 17 '22 07:10

OznOg