Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ async vs OpenMP tasks

In OpenMP, I can create a bunch of tasks as follows and run them asynchronously using some fixed number of threads:

#pragma omp parallel
{
   #pragma omp single 
   {
      for (int i = 0; i < 1000; i++) {
         #pragma omp task
         f(i);
}  }  }

In C++11, I can do something not-quite-same std::async:

std::vector<std::future> futures;
for (int i = 0; i < 1000; i++) {
   auto fut = std::async(f, i);
   futures.push_back(std::move(fut));
}
...
for (auto & fut : futures) {
  auto res = fut.get();
  // do something with res
}

What I worry about is efficiency. If I am correct, in OpenMP, tasks are stored in some task pool and then distributed to threads (automatically by the OpenMP runtime).

In C++, at the moment of invoking std::async, the runtime decides whether to run f(i) asynchronously in a new thread or defer its run to the point of invoking std::future::get.

Consequently, either a runtime

  1. creates 1000 threads and run them all concurrently,
  2. or create less threads, but then some invocation of f(i) will be performed sequentially in the main thread (within the final loop).

Both these options seem to be generally less efficient than what OpenMP does (create many tasks and run them concurrently in a fixed number of threads).

Is there any way to get the same behavior as what OpenMP tasks provide with C++ threading?

UPDATE

I did some measurements with the following code: https://wandbox.org/permlink/gLCFPr1IjTofxwQh on 12C Xeon E5 CPU compiled with GCC 7.2 and -O2:

  1. OpenMP runtime with 12 threads: 12.2 [s]
  2. C++ threading runtime: 12.4 [s]

(averages from serveral runs). They seem to be practically the same.

However, I also tried the same with 500,000 tasks (n) and 1,000 iterations within them (m) and the times then differed significantly:

  1. OpenMP runtime with 12 threads: 15.1 [s]
  2. C++ threding runtime: 175.6 [s]

UPDATE 2

I measured how many times a new thread was created (following this answer to interpose pthread_create calls: https://stackoverflow.com/a/3709027/580083):

First experiment (20,000 tasks, 20,000 iterations within):

  1. OpenMP runtime with 12 threads: 11
  2. C++ threding runtime: 20,000

Second experiment (500,000 tasks, 1,000 iterations within):

  1. OpenMP runtime with 12 threads: 11
  2. C++ threding runtime: 32,744
like image 285
Daniel Langr Avatar asked Dec 19 '17 20:12

Daniel Langr


People also ask

What are OpenMP tasks?

In OpenMP, an explicit task is specified using the task directive. The task directive defines the code associated with the task and its data environment. The task construct can be placed anywhere in the program; whenever a thread encounters a task construct, a new task is generated.

When should I use OpenMP?

OpenMP is typically used for loop-level parallelism, but it also supports function-level parallelism. This mechanism is called OpenMP sections. The structure of sections is straightforward and can be useful in many instances. Consider one of the most important algorithms in computer science, the quicksort.

Is OpenMP multithreaded?

OpenMP is an implementation of multithreading, a method of parallelizing whereby a primary thread (a series of instructions executed consecutively) forks a specified number of sub-threads and the system divides a task among them.

Does OpenMP use threads or processes?

When run, an OpenMP program will use one thread (in the sequential sections), and several threads (in the parallel sections). There is one thread that runs from the beginning to the end, and it's called the master thread.


Video Answer


1 Answers

Your analysis is quite good, but I think there is a loophole for threadpools in std::async.

OpenMP does use a fixed, user-controlled amount of threads that execute tasks quite flexibly. untied tasks can even move between threads, although that doesn't seem to be well-supported in practice.

Yes, as per the C++11 standard, the implementation must chose either std::launch::async or std::launch::deferred. The former one must create a std::thread object, while the latter one will execute the task's code in the thread calling wait. However, the standard leaves a Note (emphasis mine):

If this policy is specified together with other policies, such as when using a policy value of launch::async | launch::deferred, implementations should defer invocation or the selection of the policy when no more concurrency can be effectively exploited.

To be honest fail to see how the standard wording besides that note would allow an implementation to defer the decision - but the standard seems to actually encourage thread pools! If decision to chose launch:async is deferred, it means that the required new std::thread could reuse an existing thread of execution - at least I don't see why not.

Originally I thought that std::thread could also be implemented as green threads which sort-of also means a thread pool. However, the standard remarks that threads [managed by <thread>] are intended to map one-to-one with operating system threads.

At the end of the day measure to verify your performance. There may be a very bad OpenMP implementation or a very clever standard library implementation.

This answer to a similar question, presents some measurement results that indicate high overhead of std::async and also shares the measurement code.

like image 58
Zulan Avatar answered Sep 19 '22 01:09

Zulan