I'm reading the book An introduction to parallel programming by Peter S. Pacheco. In Section 5.6.2, it gave an interesting discussion about reducing the fork/join overhead. Consider the odd-even transposition sort algorithm:
for(phase=0; phase < n; phase++){
if(phase is even){
# pragma omp parallel for default(none) shared(n) private(i)
for(i=1; i<n; i+=2){//meat}
}
else{
# pragma omp parallel for default(none) shared(n) private(i)
for(i=1; i<n-1; i+=2){//meat}
}
}
The author argues that the above code has somewhat high fork/join overhead. Because the threads are forked and joined in each iteration of the outer loop. Hence, he proposes the following version:
# pragma omp parallel default(none) shared(n) private(i, phase)
for(phase=0; phase < n; phase++){
if(phase is even){
# pragma omp for
for(i=1; i<n; i+=2){//meat}
}
else{
# pragma omp for
for(i=1; i<n-1; i+=2){//meat}
}
}
According to the authors, the second version forks the threads before the outer loop starts and reuse the threads for each iterations, yielding better performance.
However, I'm suspicious of the correctness of the second version. In my understanding, an #pragma omp parallel
directive initiates a group of threads and let the threads execute the following structured block in parallel. In this case, the structured block should be the whole outer for-loop for(phase=0 ...)
. Then, shouldn't it be the case where the whole outer loop is executed four time given 4 threads are used? That is, if n=10
, then 40 iterations would be executed on 4 threads. What is wrong with my understanding? And how does the omp parallel
(without for) play with a following for-loop like above?
The second version is correct.
Per the OpenMP specification, a #pragma omp parallel for
directive is just a shortcut for #pragma omp parallel
immediately followed by #pragma omp for
, as in
#pragma omp parallel
{
#pragma omp for
for(int i=0; i<N; ++i) { /*loop body*/ }
}
If there is some code in the parallel region before or after the loop construct, it will be executed independently by each thread in the region (unless limited by other OpenMP directives). But, #pragma omp for
is a work sharing construct; the loop following that directive is shared by all threads in the region. I.e. it's executed as a single loop which iterations are somehow split across the threads. Thus, if the parallel region above is executed by 4 threads, still the loop will be executed just once, and not 4 times.
Back to the example in your question: the phase loop is executed by each thread individually, but #pragma omp for
at each phase iteration indicates start of a shared loop. For n=10, each thread will enter a shared loop 10 times, and execute a part of it; so there won't be 40 executions of the inner loops, but just 10.
Note there is an implicit barrier at the end of #pragma omp for
; it means a thread that completed its portion of a shared loop will not proceed until all other threads complete their portions too. So, the execution is synchronized across the threads. This is necessary to ensure correctness in most cases; e.g. in your example this guarantees that threads always work on the same phase. However if consequent shared loops within a region are safe to execute simultaneously, a nowait
clause can be used to eliminate the implicit barrier and allow threads immediately proceed to the rest of the parallel region.
Note also that such processing of work sharing directives is quite specific to OpenMP. With other parallel programming frameworks, the logic you used in the question might be correct.
And last, smart OpenMP implementations do not join threads after a parallel region is completed; instead, threads might busy-wait for some time, and then sleep until another parallel region is started. This is done exactly to prevent high overheads at the start and the end of parallel regions. So, while the optimization suggested in the book still removes some overhead (perhaps), for some algorithms its impact on execution time might be negligible. The algorithm in the question is quite likely one of these; in the first implementation, parallel regions quickly follow one after another within the serial loop, so OpenMP worker threads most likely will be active at the beginning of a region and will start it quickly, avoiding the fork/join overhead. So don't be surprised if in practice you see no performance difference from the described optimization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With