I am aware of this and this, but I ask again as the first link is pretty old now, and the second link did not seem to reach a conclusive answer. Has any consensus developed?
My problem is simple:
I have a DO
loop that has elements that may be run concurrently. Which method do I use ?
Below is code to generate particles on a simple cubic lattice.
Note the difference that x,y and z have to be arrays in the CONCURRENT case, but not so in the OpenMP case because they can be defined as being PRIVATE.
So do I use DO CONCURRENT
(which, as I understand from the links above, uses SIMD) :
DO CONCURRENT (i = 1, npart)
x(i) = MODULO(i-1, npart_edge)
Rx(i) = space*x(i)
y(i) = MODULO( ( (i-1) / npart_edge ), npart_edge)
Ry(i) = space*y(i)
z(i) = (i-1) / npart_face
Rz(i) = space*z(i)
END DO
Or do I use OpenMP?
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(x,y,z)
!$OMP DO
DO i = 1, npart
x = MODULO(i-1, npart_edge)
Rx(i) = space*x
y = MODULO( ( (i-1) / npart_edge ), npart_edge)
Ry(i) = space*y
z = (i-1) / npart_face
Rz(i) = space*z
END DO
!$OMP END DO
!$OMP END PARALLEL
My tests:
Placing 64 particles in a box of side 10:
$ ifort -qopenmp -real-size 64 omp.f90
$ ./a.out
CPU time = 6.870000000000001E-003
Real time = 3.600000000000000E-003
$ ifort -real-size 64 concurrent.f90
$ ./a.out
CPU time = 6.699999999999979E-005
Real time = 0.000000000000000E+000
Placing 100000 particles in a box of side 100:
$ ifort -qopenmp -real-size 64 omp.f90
$ ./a.out
CPU time = 8.213300000000000E-002
Real time = 1.280000000000000E-002
$ ifort -real-size 64 concurrent.f90
$ ./a.out
CPU time = 2.385000000000000E-003
Real time = 2.400000000000000E-003
Using the DO CONCURRENT
construct seems to be giving me at least an order of magnitude better performance. This was done on an i7-4790K. Also, the advantage of concurrency seems to decrease with increasing size.
OpenMP will: Allow a programmer to separate a program into serial regions and parallel regions, rather than T concurrently-executing threads.
OpenMP is typically used for loop-level parallelism, but it also supports function-level parallelism. This mechanism is called OpenMP sections. The structure of sections is straightforward and can be useful in many instances. Consider one of the most important algorithms in computer science, the quicksort.
#pragma omp parallel spawns a group of threads, while #pragma omp for divides loop iterations between the spawned threads. You can do both things at once with the fused #pragma omp parallel for directive.
When run, an OpenMP program will use one thread (in the sequential sections), and several threads (in the parallel sections). There is one thread that runs from the beginning to the end, and it's called the master thread.
DO CONCURRENT does not do any parallelization per se. The compiler may decide to parallelize it using threads or use SIMD instructions or even offload to a GPU. For threads you often have to instruct it to do so. For GPU offloading you need a particular compiler with particular options. Or (often!), the compiler just treats DO CONCURENT as a regular DO and uses SIMD if it would use them for the regular DO.
OpenMP is also not just threads, the compiler can use SIMD instructions if it wants. There is also omp simd
directive, but that is only a suggestion to the compiler to use SIMD, it can be ignored.
You should try, measure and see. There is no single definitive answer. Not even for a given compiler, the less for all compilers.
If you would not use OpenMP anyway, I would give DO CONCURRENT
a try to see if the automatic parallelizer does a better job with this construct. Chances are good that it will help. If your code is already in OpenMP, I do not see any point introducing DO CONCURRENT
.
My practice is to use OpenMP and try to make sure the compiler vectorizes (SIMD) what it can. Especially because I use OpenMP all over my program anyway. DO CONCURRENT still has to prove it is actually useful. I am not convinced, yet, but some GPU examples look promising - however, real codes are often much more complex.
Your specific examples and the performance measurement:
Too little code is given and there are subtle points in every benchmarking. I wrote some simple code around your loops and did my own tests. I was careful NOT to include the thread creation into the timed block. You should not include $omp parallel
into your timing. I also took the minimum real time over multiple computations because sometimes the first take is longer (certainly with DO CONCURRENT). CPU has various throttle modes and may need some time to spin-up. I also added SCHEDULE(STATIC)
.
npart=10000000
ifort -O3 concurrent.f90
: 6.117300000000000E-002ifort -O3 concurrent.f90 -parallel
: 5.044600000000000E-002ifort -O3 concurrent_omp.f90
: 2.419600000000000E-002
npart=10000
, default 8 threads (hyper-threading)ifort -O3 concurrent.f90
: 5.430000000000000E-004ifort -O3 concurrent.f90 -parallel
: 8.899999999999999E-005ifort -O3 concurrent_omp.f90
: 1.890000000000000E-004
npart=10000
, OMP_NUM_THREADS=4
(ignore hyper-threading)ifort -O3 concurrent.f90
: 5.410000000000000E-004ifort -O3 concurrent.f90 -parallel
: 9.200000000000000E-005ifort -O3 concurrent_omp.f90
: 1.070000000000000E-004
Here, DO CONCURRENT seems to be somewhat faster for the small case, but not too much if we make sure to use the right number of cores. It is clearly slower for the big case. The -parallel
option is clearly necessary for the automatic parallelization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With