I have two versions of code that produce equivalent results where I am trying to parallelize only the inner loop of a nested for
loop. I am not getting much speedup but I didn't expect a 1-to-1 since I'm trying only to parallelize the inner loop.
My main question is why these two versions have similar runtimes? Doesn't the second version fork threads only once and avoid the overhead of starting new threads on every iteration over i
as in the first version?
The first version of code starts up threads on every iteration of the outer loop like this:
for(i=0; i<2000000; i++){
sum = 0;
#pragma omp parallel for private(j) reduction(+:sum)
for(j=0; j<1000; j++){
sum += 1;
}
final += sum;
}
printf("final=%d\n",final/2000000);
With this output and runtime:
OMP_NUM_THREADS=1
final=1000
real 0m5.847s
user 0m5.628s
sys 0m0.212s
OMP_NUM_THREADS=4
final=1000
real 0m4.017s
user 0m15.612s
sys 0m0.336s
The second version of code starts threads once(?) before the outer loop and parallelizes the inner loop like this:
#pragma omp parallel private(i,j)
for(i=0; i<2000000; i++){
sum = 0;
#pragma omp barrier
#pragma omp for reduction(+:sum)
for(j=0; j<1000; j++){
sum += 1;
}
#pragma omp single
final += sum;
}
printf("final=%d\n",final/2000000);
With this output and runtime:
OMP_NUM_THREADS=1
final=1000
real 0m5.476s
user 0m4.964s
sys 0m0.504s
OMP_NUM_THREADS=4
final=1000
real 0m4.347s
user 0m15.984s
sys 0m1.204s
Why isn't the second version much faster than the first? Doesn't it avoid the overhead of starting threads on every loop iteration or am I doing something wrong?
In a nested loop, a break statement only stops the loop it is placed in. Therefore, if a break is placed in the inner loop, the outer loop still continues. However, if the break is placed in the outer loop, all of the looping stops.
OpenMP parallel regions can be nested inside each other. If nested parallelism is disabled, then the new team created by a thread encountering a parallel construct inside a parallel region consists only of the encountering thread. If nested parallelism is enabled, then the new team may consist of more than one thread.
When a loop is nested inside another loop, the inner loop runs many times inside the outer loop. In each iteration of the outer loop, the inner loop will be re-started. The inner loop must finish all of its iterations before the outer loop can continue to its next iteration.
An OpenMP implementation may use thread pooling to eliminate the overhead of starting threads on encountering a parallel construct. A pool of OMP_NUM_THREADS
threads is started for the first parallel construct, and after the construct is completed the slave threads are returned to the pool. These idle threads can be reallocated when a later parallel construct is encountered.
See for example this explanation of thread pooling in the Sun Studio OpenMP implementation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With