In my Fortran 95 code I have a series of nested DO loops, the whole of which takes a significant amount of time to compute, so I wanted to add parallel functionality with OpenMP (using gfortran -fopenmp
to compile/build).
There is one main DO loop, running 1000 times.
There is a sub DO loop within this, running 100 times.
Several other DO loops are nested within this, the number of iterations increases with each iteration of the DO loop (Once the first time, up to 1000 for the last time).
Example:
DO a = 1, 1000
DO b = 1, 100
DO c = 1, d
some calculations
END DO
DO c = 1, d
some calculations
END DO
DO c = 1, d
some calculations
END DO
END DO
d = d + 1
END DO
Some of the nested DO loops have to be run in serial as they contain dependencies within themselves (that is to say, each iteration of the loop has a calculation that includes a value from the previous iteration), and cannot easily be parallelised in this instance.
I can easily make the loops without any dependencies run in parallel, as below:
d = 1
DO a = 1, 1000
DO b = 1, 100
DO c = 1, d
some calculations with dependencies
END DO
!$OMP PARALLEL
!$OMP DO
DO c = 1, d
some calculations without dependencies
END DO
!$OMP END DO
!$OMP END PARALLEL
DO c = 1, d
some calculations with dependencies
END DO
END DO
d = d + 1
END DO
However I understand that there is significant overhead with opening and closing the parallel threads, given that this occurs so many times within the loops. The code runs significantly slower than it previously did when run sequentially.
Following this, I figured that it made sense to open and close the parallel code either side of the main loop (therefore only applying the overhead once), and setting the number of threads to either 1 or 8 to control whether sections are run sequentially or in parallel, as below:
d = 1
CALL omp_set_num_threads(1)
!$OMP PARALLEL
DO a = 1, 1000
DO b = 1, 100
DO c = 1, d
some calculations with dependencies
END DO
CALL omp_set_num_threads(4)
!$OMP DO
DO c = 1, d
some calculations without dependencies
END DO
!$OMP END DO
CALL omp_set_num_threads(1)
DO c = 1, d
some calculations with dependencies
END DO
END DO
d = d + 1
END DO
!$OMP END PARALLEL
However, when I set this to run I don't get the speedup that I was expecting from running parallel code. I expect the first few to be slower to account for the overhead, but after a while I expect the parallel code to run faster than the sequential code, which it doesn't. I compared how fast each iteration of the main DO loop ran, for DO a = 1, 50
, results below:
Iteration Serial Parallel
1 3.8125 4.0781
2 5.5781 5.9843
3 7.4375 7.9218
4 9.2656 9.7500
...
48 89.0625 94.9531
49 91.0937 97.3281
50 92.6406 99.6093
My first thought is that I'm somehow not setting the number of threads correctly.
Questions:
There is indeed something that is obviously wrong: you have removed any parallelism out of your code. Before creating the outermost parallel region, you defined its size to be of one thread. Therefore, only one single thread will be created to handle whatever code is inside this region. Subsequently using omp_set_num_threads(4)
won't change that. This call merely says that whichever next parallel
directive will create 4 threads (unless explicitly requested otherwise). But there's no such new parallel
directive, which would have been here nested within the current one. You only have a work-sharing do
directive which applied on the current enclosing parallel
region of one unique thread.
There are two ways of addressing your issue:
Keeping your code as it was: although formally, you will fork and join your threads upon entry and exit of the parallel
region, the OpenMP standard doesn't request that the threads are created and destroyed. Actually, it even encourages that the threads are kept alive to reduce the overhead of the parallel
directive, which is done by most OpenMP run-time libraries. Therefore, the payload of such a simple approach of the problem isn't too big.
Using your second approach of pushing the parallel
directive outside of the outermost loop, but creating as many threads as you'll need for the work-sharing (4 here I believe). Then, you enclose whatever has to be sequential within your parallel
region with a single
directive. This will ensure that no unwanted interaction with the extra threads will happen (implicit barrier and flushing of shared variable upon exit) while avoiding the parallelism where you don't want it.
This last version would look like this:
d = 1
!$omp parallel num_threads( 4 ) private( a, b, c ) firstprivate( d )
do a = 1, 1000
do b = 1, 100
!$omp single
do c = 1, d
some calculations with dependencies
end do
!$omp end single
!$omp do
do c = 1, d
some calculations without dependencies
end do
!$omp end do
!$omp single
do c = 1, d
some calculations with dependencies
end do
!$omp end single
end do
d = d + 1
end do
!$omp end parallel
Now whether this version would be actually faster compared to the naive one, it is up to you to test.
A last remark though: since there are quite a lot of sequential parts in your code, don't expect too much speed-up anyway. Amdahl's law is forever.
!$omp master
- !$omp end master
directives to reduce the execution to a single thread. Add a !$omp barrier
if you can run this block only once all other threads are done.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With