Consider following Program calculating Fibonacci Numbers.
It uses OpenMP Tasks for parallelisation.
#include <iostream>
#include <omp.h>
using namespace std;
int fib(int n)
{
if(n == 0 || n == 1)
return n;
int res, a, b;
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(a)
a = fib(n-1);
#pragma omp task shared(b)
b = fib(n-2);
#pragma omp taskwait
res = a+b;
}
}
return res;
}
int main()
{
cout << fib(40);
}
I use gcc version 4.8.2 and Fedora 20.
When compiling the above program with g++ -fopenmp name_of_program.cpp -Wall
and running it, I see when looking into htop that only two (sometimes 3) threads are running.
The machine I'm running this program on has 8 logical CPUs.
My question is, what do I need to do to offload the work onto 8 Threads.
I tried export OMP_NESTED=TRUE, but this leads to following error while running the
Program:
libgomp: Thread creation failed: Resource temporarily unavailable
The point of my program is not to efficiently compute Fibonacci Numbers, but to
use tasks or something similar in OpenMP.
With OMP_NESTED=FALSE, a team of threads is assigned to the top-level parallel region, and no extra threads at each nested level, so at most two threads will be doing useful work.
With OMP_NESTED=TRUE, a team of threads is assigned at each level. On your system there are 8 logical CPUs, so the team size is likely 8. The team includes one thread from outside the region, so only 7 new threads are launched. The recursion tree for fib(n) has about fib(n) nodes. (A nice self-referential property of fib!) Thus the code might create 7*fib(n) threads, which can quickly exhaust resources.
The fix is to use a single parallel region around the entire task tree. Move the omp parallel
and omp single
logic to main, outside of fib. That way a single thread team will work on the entire task tree.
The general point is to distinguish potential parallelism from actual parallelism. The task directives specify potential parallelism, which might or might not actually be used during an execution. An omp parallel
(for all practical purposes) specifies actual parallelism. Usually you want the actual parallelism to match the available hardware, so as not to swamp the machine, but have the potential parallelism be much larger, so that the run-time can balance load.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With