When should I use DO CONCURRENT and when OpenMP?

Tags:

I am aware of this and this, but I ask again as the first link is pretty old now, and the second link did not seem to reach a conclusive answer. Has any consensus developed?

My problem is simple:

I have a DO loop that has elements that may be run concurrently. Which method do I use ?

Below is code to generate particles on a simple cubic lattice.

npart is the number of particles
npart_edge & npart_face are that along an edge and a face, respectively
space is the lattice spacing
Rx, Ry, Rz are position arrays
x, y, z are temporary variables to decide positon on lattice

Note the difference that x,y and z have to be arrays in the CONCURRENT case, but not so in the OpenMP case because they can be defined as being PRIVATE.

So do I use DO CONCURRENT (which, as I understand from the links above, uses SIMD) :

DO CONCURRENT (i = 1, npart)
    x(i) = MODULO(i-1, npart_edge)
    Rx(i) = space*x(i)
    y(i) = MODULO( ( (i-1) / npart_edge ), npart_edge)
    Ry(i) = space*y(i)
    z(i) = (i-1) / npart_face
    Rz(i) = space*z(i)
END DO

Or do I use OpenMP?

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(x,y,z)
!$OMP DO
DO i = 1, npart
    x = MODULO(i-1, npart_edge)
    Rx(i) = space*x
    y = MODULO( ( (i-1) / npart_edge ), npart_edge)
    Ry(i) = space*y
    z = (i-1) / npart_face
    Rz(i) = space*z
END DO
!$OMP END DO
!$OMP END PARALLEL

My tests:

Placing 64 particles in a box of side 10:

$ ifort -qopenmp -real-size 64 omp.f90
$ ./a.out 
CPU time =  6.870000000000001E-003
Real time =  3.600000000000000E-003

$ ifort -real-size 64 concurrent.f90 
$ ./a.out 
CPU time =  6.699999999999979E-005
Real time =  0.000000000000000E+000

Placing 100000 particles in a box of side 100:

$ ifort -qopenmp -real-size 64 omp.f90
$ ./a.out 
CPU time =  8.213300000000000E-002
Real time =  1.280000000000000E-002

$ ifort -real-size 64 concurrent.f90 
$ ./a.out 
CPU time =  2.385000000000000E-003
Real time =  2.400000000000000E-003

Using the DO CONCURRENT construct seems to be giving me at least an order of magnitude better performance. This was done on an i7-4790K. Also, the advantage of concurrency seems to decrease with increasing size.

884

asked Jul 24 '16 07:07

physkets

1 Answers

DO CONCURRENT does not do any parallelization per se. The compiler may decide to parallelize it using threads or use SIMD instructions or even offload to a GPU. For threads you often have to instruct it to do so. For GPU offloading you need a particular compiler with particular options. Or (often!), the compiler just treats DO CONCURENT as a regular DO and uses SIMD if it would use them for the regular DO.

OpenMP is also not just threads, the compiler can use SIMD instructions if it wants. There is also omp simd directive, but that is only a suggestion to the compiler to use SIMD, it can be ignored.

You should try, measure and see. There is no single definitive answer. Not even for a given compiler, the less for all compilers.

If you would not use OpenMP anyway, I would give DO CONCURRENT a try to see if the automatic parallelizer does a better job with this construct. Chances are good that it will help. If your code is already in OpenMP, I do not see any point introducing DO CONCURRENT.

My practice is to use OpenMP and try to make sure the compiler vectorizes (SIMD) what it can. Especially because I use OpenMP all over my program anyway. DO CONCURRENT still has to prove it is actually useful. I am not convinced, yet, but some GPU examples look promising - however, real codes are often much more complex.

Your specific examples and the performance measurement:

Too little code is given and there are subtle points in every benchmarking. I wrote some simple code around your loops and did my own tests. I was careful NOT to include the thread creation into the timed block. You should not include $omp parallel into your timing. I also took the minimum real time over multiple computations because sometimes the first take is longer (certainly with DO CONCURRENT). CPU has various throttle modes and may need some time to spin-up. I also added SCHEDULE(STATIC).

npart=10000000
ifort -O3 concurrent.f90: 6.117300000000000E-002
ifort -O3 concurrent.f90 -parallel: 5.044600000000000E-002
ifort -O3 concurrent_omp.f90: 2.419600000000000E-002

npart=10000, default 8 threads (hyper-threading)
ifort -O3 concurrent.f90: 5.430000000000000E-004
ifort -O3 concurrent.f90 -parallel: 8.899999999999999E-005
ifort -O3 concurrent_omp.f90: 1.890000000000000E-004

npart=10000, OMP_NUM_THREADS=4 (ignore hyper-threading)
ifort -O3 concurrent.f90: 5.410000000000000E-004
ifort -O3 concurrent.f90 -parallel: 9.200000000000000E-005
ifort -O3 concurrent_omp.f90: 1.070000000000000E-004

Here, DO CONCURRENT seems to be somewhat faster for the small case, but not too much if we make sure to use the right number of cores. It is clearly slower for the big case. The -parallel option is clearly necessary for the automatic parallelization.

answered Sep 21 '22 14:09

Vladimir F Героям слава

Related questions
                            
                                Java UrlConnection triggering "Connection reset" exceptions under high load. Why?
                            
                                Notify Threads When Counter Changes
                            
                                Is running a ExecutorService inside a SwingWorker a good practice?
                            
                                How to solve Index/Key related Deadlock
                            
                                Java - atomically delete a (non-empty) directory
                            
                                Do I need a fence or barrier or something when mutex locks/unlocks are buried deep in function calls?
                            
                                WCF Service only handling 10 concurrent calls regardless of what I do [duplicate]
                            
                                What properties are guaranteed by constructors in Java?
                            
                                Concurrent map with weak keys
                            
                                Stop flickering in swing when i repaint too much
                            
                                Why should we not swallow the InterruptedException
                            
                                Mystery (concurrency/component drawing?) bug in very simple Swing dice program
                            
                                Does python have compare and swap operations
                            
                                How do goroutines work?
                            
                                Golang prevent channel from blocking
                            
                                Using both gevent (or eventlet) and prefork workers with Celery
                            
                                Concurrent collection size calculation
                            
                                Visibility Guarantee
                            
                                How does DelegatingVehicleTracker (p. 65 Goetz) return a "live" view?
                            
                                Within Golang struct shared among multiple goroutines, do non-shared members need mutex protection?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When should I use DO CONCURRENT and when OpenMP?

Tags:

concurrency

fortran

simd

openmp

fortran2008

physkets

People also ask

1 Answers

Vladimir F Героям слава

Recent Activity

Donate For Us