In OpenMP, I can create a bunch of tasks as follows and run them asynchronously using some fixed number of threads:
#pragma omp parallel
{
#pragma omp single
{
for (int i = 0; i < 1000; i++) {
#pragma omp task
f(i);
} } }
In C++11, I can do something not-quite-same std::async
:
std::vector<std::future> futures;
for (int i = 0; i < 1000; i++) {
auto fut = std::async(f, i);
futures.push_back(std::move(fut));
}
...
for (auto & fut : futures) {
auto res = fut.get();
// do something with res
}
What I worry about is efficiency. If I am correct, in OpenMP, tasks are stored in some task pool and then distributed to threads (automatically by the OpenMP runtime).
In C++, at the moment of invoking std::async
, the runtime decides whether to run f(i)
asynchronously in a new thread or defer its run to the point of invoking std::future::get
.
Consequently, either a runtime
f(i)
will be performed sequentially in the main thread (within the final loop).Both these options seem to be generally less efficient than what OpenMP does (create many tasks and run them concurrently in a fixed number of threads).
Is there any way to get the same behavior as what OpenMP tasks provide with C++ threading?
UPDATE
I did some measurements with the following code: https://wandbox.org/permlink/gLCFPr1IjTofxwQh on 12C Xeon E5 CPU compiled with GCC 7.2 and -O2
:
(averages from serveral runs). They seem to be practically the same.
However, I also tried the same with 500,000 tasks (n
) and 1,000 iterations within them (m
) and the times then differed significantly:
UPDATE 2
I measured how many times a new thread was created (following this answer to interpose pthread_create
calls: https://stackoverflow.com/a/3709027/580083):
First experiment (20,000 tasks, 20,000 iterations within):
Second experiment (500,000 tasks, 1,000 iterations within):
In OpenMP, an explicit task is specified using the task directive. The task directive defines the code associated with the task and its data environment. The task construct can be placed anywhere in the program; whenever a thread encounters a task construct, a new task is generated.
OpenMP is typically used for loop-level parallelism, but it also supports function-level parallelism. This mechanism is called OpenMP sections. The structure of sections is straightforward and can be useful in many instances. Consider one of the most important algorithms in computer science, the quicksort.
OpenMP is an implementation of multithreading, a method of parallelizing whereby a primary thread (a series of instructions executed consecutively) forks a specified number of sub-threads and the system divides a task among them.
When run, an OpenMP program will use one thread (in the sequential sections), and several threads (in the parallel sections). There is one thread that runs from the beginning to the end, and it's called the master thread.
Your analysis is quite good, but I think there is a loophole for threadpools in std::async
.
OpenMP does use a fixed, user-controlled amount of threads that execute tasks quite flexibly. untied
tasks can even move between threads, although that doesn't seem to be well-supported in practice.
Yes, as per the C++11 standard, the implementation must chose either std::launch::async
or std::launch::deferred
. The former one must create a std::thread
object, while the latter one will execute the task's code in the thread calling wait
. However, the standard leaves a Note (emphasis mine):
If this policy is specified together with other policies, such as when using a
policy
value oflaunch::async | launch::deferred
, implementations should defer invocation or the selection of the policy when no more concurrency can be effectively exploited.
To be honest fail to see how the standard wording besides that note would allow an implementation to defer the decision - but the standard seems to actually encourage thread pools! If decision to chose launch:async
is deferred, it means that the required new std::thread
could reuse an existing thread of execution - at least I don't see why not.
Originally I thought that std::thread
could also be implemented as green threads which sort-of also means a thread pool. However, the standard remarks that threads [managed by <thread>
] are intended to map one-to-one with operating system threads.
At the end of the day measure to verify your performance. There may be a very bad OpenMP implementation or a very clever standard library implementation.
This answer to a similar question, presents some measurement results that indicate high overhead of std::async
and also shares the measurement code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With