Controlling Number of Threads in Parallel Loops & Reducing Overhead

Question

In my Fortran 95 code I have a series of nested DO loops, the whole of which takes a significant amount of time to compute, so I wanted to add parallel functionality with OpenMP (using gfortran -fopenmp to compile/build).

There is one main DO loop, running 1000 times.

There is a sub DO loop within this, running 100 times.

Several other DO loops are nested within this, the number of iterations increases with each iteration of the DO loop (Once the first time, up to 1000 for the last time).

Example:

DO a = 1, 1000

    DO b = 1, 100

        DO c = 1, d
            some calculations
        END DO

        DO c = 1, d
            some calculations
        END DO

        DO c = 1, d
            some calculations
        END DO
    END DO
    d = d + 1
END DO

Some of the nested DO loops have to be run in serial as they contain dependencies within themselves (that is to say, each iteration of the loop has a calculation that includes a value from the previous iteration), and cannot easily be parallelised in this instance.

I can easily make the loops without any dependencies run in parallel, as below:

d = 1
DO a = 1, 1000

    DO b = 1, 100

        DO c = 1, d
            some calculations with dependencies
        END DO
!$OMP PARALLEL
!$OMP DO
        DO c = 1, d
            some calculations without dependencies
        END DO
!$OMP END DO
!$OMP END PARALLEL
        DO c = 1, d
            some calculations with dependencies
        END DO
    END DO
    d = d + 1
END DO

However I understand that there is significant overhead with opening and closing the parallel threads, given that this occurs so many times within the loops. The code runs significantly slower than it previously did when run sequentially.

Following this, I figured that it made sense to open and close the parallel code either side of the main loop (therefore only applying the overhead once), and setting the number of threads to either 1 or 8 to control whether sections are run sequentially or in parallel, as below:

d = 1
CALL omp_set_num_threads(1)
!$OMP PARALLEL
DO a = 1, 1000

    DO b = 1, 100

        DO c = 1, d
            some calculations with dependencies
        END DO
    CALL omp_set_num_threads(4)
!$OMP DO
        DO c = 1, d
            some calculations without dependencies
        END DO
!$OMP END DO
    CALL omp_set_num_threads(1)

        DO c = 1, d
            some calculations with dependencies
        END DO
    END DO
    d = d + 1
END DO
!$OMP END PARALLEL

However, when I set this to run I don't get the speedup that I was expecting from running parallel code. I expect the first few to be slower to account for the overhead, but after a while I expect the parallel code to run faster than the sequential code, which it doesn't. I compared how fast each iteration of the main DO loop ran, for DO a = 1, 50, results below:

Iteration    Serial    Parallel
1            3.8125    4.0781              
2            5.5781    5.9843              
3            7.4375    7.9218              
4            9.2656    9.7500              
...                              
48           89.0625   94.9531                
49           91.0937   97.3281                
50           92.6406   99.6093

My first thought is that I'm somehow not setting the number of threads correctly.

Questions:

Is there something obviously wrong with how I've structured the parallel code?
Is there a better way to implement what I've done / want to do?

Gilles · Accepted Answer

There is indeed something that is obviously wrong: you have removed any parallelism out of your code. Before creating the outermost parallel region, you defined its size to be of one thread. Therefore, only one single thread will be created to handle whatever code is inside this region. Subsequently using omp_set_num_threads(4) won't change that. This call merely says that whichever next parallel directive will create 4 threads (unless explicitly requested otherwise). But there's no such new parallel directive, which would have been here nested within the current one. You only have a work-sharing do directive which applied on the current enclosing parallel region of one unique thread.

There are two ways of addressing your issue:

Keeping your code as it was: although formally, you will fork and join your threads upon entry and exit of the parallel region, the OpenMP standard doesn't request that the threads are created and destroyed. Actually, it even encourages that the threads are kept alive to reduce the overhead of the parallel directive, which is done by most OpenMP run-time libraries. Therefore, the payload of such a simple approach of the problem isn't too big.
Using your second approach of pushing the parallel directive outside of the outermost loop, but creating as many threads as you'll need for the work-sharing (4 here I believe). Then, you enclose whatever has to be sequential within your parallel region with a single directive. This will ensure that no unwanted interaction with the extra threads will happen (implicit barrier and flushing of shared variable upon exit) while avoiding the parallelism where you don't want it.

This last version would look like this:

d = 1
!$omp parallel num_threads( 4 ) private( a, b, c ) firstprivate( d )
do a = 1, 1000
    do b = 1, 100
!$omp single
        do c = 1, d
            some calculations with dependencies
        end do
!$omp end single
!$omp do
        do c = 1, d
            some calculations without dependencies
        end do
!$omp end do
!$omp single    
        do c = 1, d
            some calculations with dependencies
        end do
!$omp end single
    end do
    d = d + 1
end do
!$omp end parallel

Now whether this version would be actually faster compared to the naive one, it is up to you to test.

A last remark though: since there are quite a lot of sequential parts in your code, don't expect too much speed-up anyway. Amdahl's law is forever.

Pierre de Buyl · Answer

Nothing obviously wrong but if the serial loops take long, your speedup will be limited. Doing parallel computing might require to redesign your algorithms.
Instead of setting the number of threads in the loop, use the !$omp master - !$omp end master directives to reduce the execution to a single thread. Add a !$omp barrier if you can run this block only once all other threads are done.

Controlling Number of Threads in Parallel Loops & Reducing Overhead

Tags:

loops

parallel-processing

fortran

openmp

gfortran

localghost

2 Answers

Gilles

Pierre de Buyl

Recent Activity

Donate For Us

Controlling Number of Threads in Parallel Loops & Reducing Overhead

Tags:

loops

parallel-processing

fortran

openmp

gfortran

localghost

2 Answers

Gilles

Pierre de Buyl

Related questions

Recent Activity

Donate For Us