I was reading Chapter 8 of the "Modern C++ Programming Cookbook, 2nd edition" on concurrency and stumbled upon something that puzzles me.
The author implements different versions of parallel map and reduce functions using std::thread
and std::async
. The implementations are really close; for example, the heart of the parallel_map
functions are
// parallel_map using std::async
...
tasks.emplace_back(std::async(
std::launch::async,
[=, &f] {std::transform(begin, last, begin, std::forward<F>(f)); }));
...
// parallel_map using std::thread
...
threads.emplace_back([=, &f] {std::transform(begin, last, begin, std::forward<F>(f)); });
...
The complete code can be found here for std::thread
and there for std::async
.
What puzzles me is that the computation times reported in the book give a significant and consistent advantage to the std::async
implementation. Moreover, the author acknowledge this fact as being obvious, without providing any hint of justification:
If we compare this [result with async] with the results from the parallel version using threads, we will find that these are faster execution times and that the speedup is significant, especially for the
fold
function.
I ran the code above on my computer, and even though the differences are not as compelling as in the book, I find that the std::async
implementation is indeed faster than the std::thread
one. (The author also later brings in standard implementations of these algorithms, which are even faster). On my computer, the code runs with four threads, which corresponds to the number of physical cores of my CPU.
Maybe I missed something, but why is it obvious that std::async
should run faster than std::thread
on this example? My intuition was that std::async
being a higher-level implementation of threads, it should take at least the same amount of time, if not more, than threads -- obviously I was wrong. Are those findings consistent, as suggested by the book, and what is the explanation?
The function template async runs the function f asynchronously (potentially in a separate thread which might be a part of a thread pool) and returns a std::future that will eventually hold the result of that function call. 1) Behaves as if (2) is called with policy being std::launch::async | std::launch::deferred.
For now, we know that if no policy is specified, then std::async launches a callable function in a separate thread. However, the C++ standard does not specify whether the thread is a new one or reused from a thread pool. Let us see how each of the three implementations launches a callable function.
std::async. Calls fn (with args as arguments) at some point, returning without waiting for the execution of fn to complete. The value returned by fn can be accessed through the future object returned (by calling its member future::get ).
As the name indicates, C++ async is a function template fn, which takes functions or function objects as arguments (basically called callbacks) and runs them asynchronously. It returns the std:: the future object which is used to keep the result of the above function. The result is stored in the shared state.
Concurrent approaches should never be used to speed up small computation problems. Reason for this is the time required to initialize and finalize libraries, mutexes, semaphores etc has a big impact on the output. On the other hand for a large computation problem, even though there is an extra overhead, the overall performance increases.
Is Concurrency Really Increase the Performance? If you want to increase the performance of your program one possible solution is to add concurrent programming techniques. Basically, in concurrent execution, multiple threads of the same program executes at the same time. It is similar to adding more workers to complete a job.
std::async is an easy way to do multiple things concurrently, without the hurdle of manual thread management in C++. Like batch converting images, database calls, http requests, you name it.
Since there are multiple cores, it is possible to execute more than one thread at a time. On a single core processor, actual parallelism is not possible. However, it tries to archive parallelism by switching between threads quickly.
My original interpretation was incorrect. Refer to @OznOg's answer below.
Modified Answer:
I created a simple benchmark that uses std::async
and std::thread
to do some tiny tasks:
#include <thread>
#include <chrono>
#include <vector>
#include <future>
#include <iostream>
__thread volatile int you_shall_not_optimize_this;
void work() {
// This is the simplest way I can think of to prevent the compiler and
// operating system from doing naughty things
you_shall_not_optimize_this = 42;
}
[[gnu::noinline]]
std::chrono::nanoseconds benchmark_threads(size_t count) {
std::vector<std::optional<std::thread>> threads;
threads.resize(count);
auto before = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < count; ++i)
threads[i] = std::thread { work };
for (size_t i = 0; i < count; ++i)
threads[i]->join();
threads.clear();
auto after = std::chrono::high_resolution_clock::now();
return after - before;
}
[[gnu::noinline]]
std::chrono::nanoseconds benchmark_async(size_t count, std::launch policy) {
std::vector<std::optional<std::future<void>>> results;
results.resize(count);
auto before = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < count; ++i)
results[i] = std::async(policy, work);
for (size_t i = 0; i < count; ++i)
results[i]->wait();
results.clear();
auto after = std::chrono::high_resolution_clock::now();
return after - before;
}
std::ostream& operator<<(std::ostream& stream, std::launch value)
{
if (value == std::launch::async)
return stream << "std::launch::async";
else if (value == std::launch::deferred)
return stream << "std::launch::deferred";
else
return stream << "std::launch::unknown";
}
// #define CONFIG_THREADS true
// #define CONFIG_ITERATIONS 10000
// #define CONFIG_POLICY std::launch::async
int main() {
std::cout << "Running benchmark:\n"
<< " threads? " << std::boolalpha << CONFIG_THREADS << '\n'
<< " iterations " << CONFIG_ITERATIONS << '\n'
<< " async policy " << CONFIG_POLICY << std::endl;
std::chrono::nanoseconds duration;
if (CONFIG_THREADS) {
duration = benchmark_threads(CONFIG_ITERATIONS);
} else {
duration = benchmark_async(CONFIG_ITERATIONS, CONFIG_POLICY);
}
std::cout << "Completed in " << duration.count() << "ns (" << std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() << "ms)\n";
}
I've run the benchmark as follows:
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
threads? false
iterations 10000
async policy std::launch::deferred
Completed in 4783327ns (4ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
threads? false
iterations 10000
async policy std::launch::async
Completed in 301756775ns (301ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
threads? true
iterations 10000
async policy std::launch::deferred
Completed in 291284997ns (291ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
threads? true
iterations 10000
async policy std::launch::async
Completed in 293539858ns (293ms)
I re-ran all the benchmarks with strace
attached and accumulated the system calls made:
# std::async with std::launch::async
1 access
2 arch_prctl
36 brk
10000 clone
6 close
1 execve
1 exit_group
10002 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::async with std::launch::deferred
1 access
2 arch_prctl
11 brk
6 close
1 execve
1 exit_group
10002 futex
28 mmap
9 mprotect
2 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
1 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::thread with std::launch::async
1 access
2 arch_prctl
27 brk
10000 clone
6 close
1 execve
1 exit_group
2 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::thread with std::launch::deferred
1 access
2 arch_prctl
27 brk
10000 clone
6 close
1 execve
1 exit_group
2 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
We observe that std::async
is significantly faster with std::launch::deferred
but that everything else doesn't seem to matter as much.
My conclusions are:
The current libstdc++ implementation does not take advantage of the fact that std::async
doesn't need a new thread for each task.
The current libstdc++ implementation does some sort of locking in std::async
that std::thread
doesn't do.
std::async
with std::launch::deferred
saves setup and destroy costs and is much faster for this case.
My machine is configured as follows:
$ uname -a
Linux linux-2 5.12.1-arch1-1 #1 SMP PREEMPT Sun, 02 May 2021 12:43:58 +0000 x86_64 GNU/Linux
$ g++ --version
g++ (GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ lscpu # truncated
Architecture: x86_64
Byte Order: Little Endian
CPU(s): 8
Model name: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Original Answer:
std::thread
is a wrapper for thread objects which are provided by the operating system, they are extremely expensive to create and destroy.
std::async
is similar, but there isn't a 1-to-1 mapping between tasks and operating system threads. This could be implemented with thread pools, where threads are reused for multiple tasks.
So std::async
is better if you have many small tasks, and std::thread
is better if you have a few tasks that are running for long periods of time.
Also if you have things that truly need to happen in parallel, then std::async
might not fit very well. (std::thread
also can't make such guarantees, but that's the closest you can get.)
Maybe to clarify, in your case std::async
saves the overhead from creating and destroying threads.
(Depending on the operating system, you could also lose performance simply by having a lot of threads running. An operating system might have a scheduling strategy where it tries to guarantee that every thread gets executed every so often, thus the scheduler could decide go give the individual threads smaller slices of processing time, thus creating more overhead for switching between threads.)
Looks like something is not happening like expected. I compiled the whole thing on my fedora and the first results where surprising. I commented out all tests to keep only the 2 ones comparing threads and async. The output looks like confirming the behaviour:
Thead version result
size s map p map s fold p fold
10000 642 628 751 770
100000 6533 3444 7985 3338
500000 14885 5760 13854 6304
1000000 23428 11398 27795 12129
2000000 47136 22468 55518 24154
5000000 118690 55752 139489 60142
10000000 236496 112467 277413 121002
25000000 589277 276750 694742 297832
500000001.17839e+06 5553181.39065e+06 594102
Async version:
size s map p1 map p2 map s fold p1 fold p2 fold
10000 248 232 231 273 282 273
100000 2323 1562 2827 2757 1536 1766
500000 12312 5615 12044 14014 6272 7431
1000000 23585 11701 24060 27851 12376 14109
2000000 47147 22796 48035 55433 25565 30095
5000000 118465 59980 119698 140775 62960 68382
10000000 241727 110883 239554 277958 121205 136041
Looks like the async is actually 2 times faster (for small vallues) than the threaded one.
Then I used strace to count the number of clone
system call done (number of thread created):
64 clone with threads
92 clone with async
So looks like the explaination on the time spend to create thread is somehow in contradiction as he async version actually creates as many threads than the thread based one (the difference comes from the fact that there are two version of fold in the async code).
Then I tried to swap both test order of execution (put async before thread), and here are the results:
size s map p1 map p2 map s fold p1 fold p2 fold
10000 653 694 624 718 748 718
100000 6731 3931 2978 8533 3116 1724
500000 12406 5839 14589 13895 8427 7072
1000000 23813 11578 24099 27853 13091 14108
2000000 47357 22402 48197 55469 24572 33543
5000000 117923 55869 120303 139061 61801 68281
10000000 234861 111055 239124 277153 121270 136953
size s map p map s fold p fold
10000 232 232 273 328
100000 6424 3271 8297 4487
500000 21329 5547 13913 6263
1000000 23654 11419 27827 12083
2000000 47230 22763 55653 24135
5000000 117448 56785 139286 61679
10000000 235394 111021 278177 119805
25000000 589329 279637 696392 301485
500000001.1824e+06 5564431.38722e+06 606279
So now the "thread" version is 2 time faster than async for small values.
Looking at clone calls di not show any differences:
92 clone
64 clone
I didn't have much time to go further, but at least on linux, we can consider that there are no differences between the two version (the async could even be seen as less efficient as it requires more threads).
We can see that it has nothing to do with the async/thread problem.
Moreover, if we look at values that realy need computation time, the difference of time is really small and not relevent: 55752us vs 56785us for 5'000'000, and keeps matching for bigger values.
This looks like usual problem with microbenches, we somehow measure latencies of the system more than the computation time itself.
Note: the figures shown are without optimization (original code) ; adding -O3
obviously speeds up the computations, but results show the same: there is not real difference in computation time on big values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With