My computer has four cores. I'm running Ubuntu 15.10, and compiling using g++ -fopenmp ...
I have two different types of jobs, and both are mutually independent: Work1 and Work2. In particular, Work1 should run on a single processor, but Work2 should be parallelized. I tried using omp_set_num_threads():
#pragma omp parallel sections
{
#pragma omp section
{
// Should run on one processor.
omp_set_num_threads(1);
Work1();
}
#pragma omp section
{
// Should run on as many processors as possible.
omp_set_num_threads(3);
Work2();
}
}
Say Work2 is something like this:
void Work2(...){
#pragma omp parallel for
for (...) ...
return;
}
When the program is run, only two processors are used. Obviously omp_set_num_threads() is not working as I expected. Is there anything that can be done using OpenMP to remedy this situation?
Thanks to all,
Rodrigo
omp_get_num_threads() The omp_get_num_threads function returns the number of threads in the team currently executing the parallel region from which it is called. The function binds to the closest enclosing PARALLEL directive.
When run, an OpenMP program will use one thread (in the sequential sections), and several threads (in the parallel sections). There is one thread that runs from the beginning to the end, and it's called the master thread. The parallel sections of the program will cause additional threads to fork.
1. We can conclude that the default number of threads on moneta is 16. 3. OpenMP requires that I/O be thread safe; i.e., output from one thread is handled without interference from any other threads.
The pragma omp parallel is used to fork additional threads to carry out the work enclosed in the construct in parallel. The original thread will be denoted as master thread with thread ID 0. Example (C program): Display "Hello, world." using multiple threads.
First of all, the OpenMP standard gives no guarantee that the two sections will be executed by different threads (Section 2.7.2 "sections
Construct"):
The method of scheduling the structured blocks among the threads in the team is implementation defined.
The only reliable way to have the two work routines execute concurrently is by using explicit flow control based on the thread ID:
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num() == 0)
{
omp_set_num_threads(1);
Work1();
}
else
{
omp_set_num_threads(3);
Work2();
}
}
Further, whether the nested parallel region in Work2()
will use more than one thread, depends on a combination of factors. Among those factors are the values of several internal control variables (ICVs):
OMP_NESTED
and set at runtime by calling omp_set_nested()
;OMP_THREAD_LIMIT
and set at runtime by the application of the thread_limit
clause;OMP_MAX_ACTIVE_LEVELS
and set by calling omp_set_max_active_levels()
.If nest-var is false, then the value of the other ICVs do not matter - nested parallelism is disabled. This is the default value as mandated by the standard, therefore nested parallelism must be enabled explicitly.
If nested parallelism is enabled, it only works at levels up to max-active-levels with the outermost parallel region being at level 1, the first nested parallel region being at level 2, etc. The default value of that ICV is the number of levels of nested parallelism supported by the implementation. Parallel regions at deeper levels get disabled, i.e. execute serially with their master threads only.
If nested parallelism is enabled and a particular parallel region is nested at a level no deeper than max-active-levels, then whether it will execute in parallel or not is determined by the value of thread-limit-var. In your case, any value less than 4 will result in Work2()
not being able to execute with three threads.
The following test program could be used to examine the interplay between those ICVs:
#include <stdio.h>
#include <omp.h>
void Work1(void)
{
printf("Work1 started by tid %d/%d\n",
omp_get_thread_num(), omp_get_num_threads());
}
void Work2(void)
{
printf("Work2 started by tid %d/%d\n",
omp_get_thread_num(), omp_get_num_threads());
#pragma omp parallel for schedule(static)
for (int i = 0; i < 3; i++)
{
printf("Work2 nested loop: %d by tid %d/%d\n", i,
omp_get_thread_num(), omp_get_num_threads());
}
}
int main(void)
{
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num() == 0)
{
omp_set_num_threads(1);
Work1();
}
else
{
omp_set_num_threads(3);
Work2();
}
}
return 0;
}
Sample outputs:
$ ./nested
Work1: started by tid 0/2
Work2: started by tid 1/2
Work2 nested loop: 0 by tid 0/1
Work2 nested loop: 1 by tid 0/1
Work2 nested loop: 2 by tid 0/1
The outermost parallel region is active. The nested one in Work2()
is inactive because nested parallelism is disabled by default.
$ OMP_NESTED=TRUE ./nested
Work1: started by tid 0/2
Work2: started by tid 1/2
Work2 nested loop: 0 by tid 0/3
Work2 nested loop: 1 by tid 1/3
Work2 nested loop: 2 by tid 2/3
All parallel regions are active and execute in parallel.
$ OMP_NESTED=TRUE OMP_MAX_ACTIVE_LEVELS=1 ./nested
Work1: started by tid 0/2
Work2: started by tid 1/2
Work2 nested loop: 0 by tid 0/1
Work2 nested loop: 1 by tid 0/1
Work2 nested loop: 2 by tid 0/1
Despite nested parallelism being enabled, only one level of parallelism could be active, therefore the nested region executes serially. With pre-OpenMP 3.0 compilers, e.g. GCC 4.4, setting OMP_MAX_ACTIVE_LEVELS
has no effect.
$ OMP_NESTED=TRUE OMP_THREAD_LIMIT=3 ./nested
Work1: started by tid 0/2
Work2: started by tid 1/2
Work2 nested loop: 0 by tid 0/2
Work2 nested loop: 2 by tid 1/2
Work2 nested loop: 1 by tid 0/2
The nested region is active, but executes with two threads only because of the global thread limit imposed by setting OMP_THREAD_LIMIT
.
If your have enabled nested parallelism, there is no limit on the number of active levels, and the thread limit is sufficiently high, there should be no reason for your program not to use four CPU cores at the same time...
... unless process and/or thread binding is in effect. Binding controls the affinity of different OpenMP threads to the available CPUs. With most OpenMP runtimes thread binding is disabled by default and the OS scheduler is free to move the threads between the available cores as it deems fit. Nevertheless, the runtimes usually respect the affinity mask that applies to the process as whole. If you use something like taskset
to e.g. pin/bind the process to two logical CPUs, then no matter how many threads are spawned, they will all run on the two logical CPUs and timeshare. With GCC thread binding is controlled by setting GOMP_CPU_AFFINITY
and/or OMP_PROC_BIND
, and with recent versions that support OpenMP 4.0 - by setting OMP_PLACES
.
If you are not binding the executable (verify by checking the value of Cpus_allowed
in /proc/$PID/status
, where $PID
is the PID of the running OpenMP process), neither GOMP_CPU_AFFINITY
/OMP_PROC_BIND
nor OMP_PLACES
is set, nested parallelism is enabled, no limits on active parallelism levels or thread numbers are imposed, and programs like top
or htop
still show that only two logical CPUs are being used, then there is something wrong with your program's logic and not with the OpenMP environment.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With