I have a relatively simple loop where I'm calculating the net acceleration of a system of particles using a brute-force method.
I have a working OpenMP loop which loops over each particles and compares it to every other particles for an n^2 complexity here:
!$omp parallel do private(i) shared(bodyArray, n) default(none)
do i = 1, n
!acc is real(real64), dimension(3)
bodyArray(i)%acc = bodyArray(i)%calcNetAcc(i, bodyArray)
end do
which works just fine.
What I'm trying to do now is to reduce my calculation time by only computing the force on each body once using the fact that the force from F(a->b) = -F(b->a), reducing the number of interactions to calculate by half (n^2 / 2). Which I do in this loop:
call clearAcceleration(bodyArray) !zero out acceleration
!$omp parallel do private(i, j) shared(bodyArray, n) default(none)
do i = 1, n
do j = i, n
if ( i /= j .and. j > i) then
bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
end if
end do
end do
But I'm having a lot of difficulty with this parallelizing this loop, I keep getting junk results. I think it has to do with this line:
bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
and that the forces are not being added up properly with all the different 'j' writing to it. I've tried using the atomic statement, but that's not allowed on array variables. So then I tried critical, but that increases the time it takes by about 20, and still doesn't give correct results. I also tried adding an ordered statement, but then I just get NaN for all my results. Is there an easy fix to get this loop working with OpenMP?
Working code, it has a slight speed improvement but not the ~2x I was looking for.
!$omp parallel do private(i, j) shared(bodyArray, forces, n) default(none) schedule(guided)
do i = 1, n
do j = 1, i-1
forces(j, i)%vec = bodyArray(i)%accTo(bodyArray(j))
forces(i, j)%vec = -forces(j, i)%vec
end do
end do
!$omp parallel do private(i, j) shared(bodyArray, n, forces) schedule(static)
do i = 1, n
do j = 1, n
bodyArray(i)%acc = bodyArray(i)%acc + forces(j, i)%vec
end do
end do
OpenMP parallel regions can be nested inside each other. If nested parallelism is disabled, then the new team created by a thread encountering a parallel construct inside a parallel region consists only of the encountering thread. If nested parallelism is enabled, then the new team may consist of more than one thread.
OpenMP is an implementation of multithreading, a method of parallelizing whereby a primary thread (a series of instructions executed consecutively) forks a specified number of sub-threads and the system divides a task among them.
# pragma omp critical , (name) where name can optionally be used to identify the critical region. Identifiers naming a critical region have external linkage and occupy a namespace distinct from that used by ordinary identifiers.
OpenMP will: Allow a programmer to separate a program into serial regions and parallel regions, rather than T concurrently-executing threads.
With your current approach and data structures you're going to struggle to get good speedup with OpenMP. Consider the loop nest
!$omp parallel do private(i, j) shared(bodyArray, n) default(none)
do i = 1, n
do j = i, n
if ( i /= j .and. j > i) then
bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
end if
end do
end do
[Actually, before you consider it, revise it as follows ...
!$omp parallel do private(i, j) shared(bodyArray, n) default(none)
do i = 1, n
do j = i+1, n
bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
end do
end do
..., now back to the issues]
There are two problems here:
bodyArray(j)%acc
; multiple threads will try to update the same element and there is no coordination of those updates. Junk results. Using critical
sections or ordering the statements serialises the code; when you get it right you also get it as slow as it was before you started with OpenMP.bodyArray
is cache-unfriendly. It wouldn't surprise me to find that, even if you address the data race without serialising the computation, the impact of the cache-unfriendliness is to produce code slower than the original. Modern CPUs are crazy-fast in computation but the memory systems struggle to feed the beasts so cache effects can be massive. Trying to run two loops over the same rank-1 array simultaneously, which is in essence what your code does, is never (?) going to shift data through cache at maximum speed.Personally I would try the following. I'm not going to guarantee that this will be faster, but it will be easier (I think) than fixing your current approach and fit OpenMP like a glove. I do have a nagging doubt that this is overcomplicating matters, but I haven't had a better idea yet.
First, create a 2D array of reals, call it forces
, where element force(i,j)
is the force that element i
exerts on j
. Then, some code like this (untested, that's your responsibility if you care to follow this line)
forces = 0.0 ! Parallelise this if you want to
!$omp parallel do private(i, j) shared(forces, n) default(none)
do i = 1, n
do j = 1, i-1
forces(i,j) = bodyArray(i)%accTo(bodyArray(j)) ! if I understand correctly
end do
end do
then sum the forces on each particle (and get the following right, I haven't checked carefully)
!$omp parallel do private(i) shared(bodyArray,forces, n) default(none)
do i = 1, n
bodyArray(i)%acc = sum(forces(:,i))
end do
As I wrote above, computation is extremely fast and if you have the memory to spare it's often well worth trading some space for time.
Now what you have is, probably, a problem with load balancing in the loop nest over forces
. Most OpenMP implementations will, by default, perform a static distribution of work (this is not required by the standard but seems to be most common, check your documentation). So thread 1 will get the first n/num_threads
rows to deal with, but these are the itty-bitty little rows at the top of the triangle you're computing. Thread 2 will get more work, thread 3 still more, and so forth. You might get away with simply adding a schedule(dynamic)
clause to the parallel
directive, you might have to work a bit harder to balance the load.
You may also want to review my code snippets wrt cache-friendliness and adjust as appropriate. And you may well find, if you do as I suggest, that you were better off with your original code, that halving the amount of computation doesn't actually save much time.
Another approach would be to pack the lower (or upper) triangle of forces
into a rank-1 array and use some fancy indexing arithmetic to transform 2D (i,j)
indices into a 1D index into that array. This would save storage space, and might be easier to make cache-friendly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With