I have a relatively simple loop where I'm calculating the net acceleration of a system of particles using a brute-force method. I have a working OpenMP loop which loops over each particles and compares it to every other particles for an n^2 complexity here: <pre class="prettyprint"><code> !$omp parallel do private(i) shared(bodyArray, n) default(none) do i = 1, n !acc is real(real64), dimension(3) bodyArray(i)%acc = bodyArray(i)%calcNetAcc(i, bodyArray) end do </code></pre> which works just fine. What I'm trying to do now is to reduce my calculation time by only computing the force on each body once using the fact that the force from F(a->b) = -F(b->a), reducing the number of interactions to calculate by half (n^2 / 2). Which I do in this loop: <pre class="prettyprint"><code>call clearAcceleration(bodyArray) !zero out acceleration !$omp parallel do private(i, j) shared(bodyArray, n) default(none) do i = 1, n do j = i, n if ( i /= j .and. j > i) then bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j)) bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc end if end do end do </code></pre> But I'm having a lot of difficulty with this parallelizing this loop, I keep getting junk results. I think it has to do with this line: <pre class="prettyprint"><code>bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc </code></pre> and that the forces are not being added up properly with all the different 'j' writing to it. I've tried using the atomic statement, but that's not allowed on array variables. So then I tried critical, but that increases the time it takes by about 20, and still doesn't give correct results. I also tried adding an ordered statement, but then I just get NaN for all my results. Is there an easy fix to get this loop working with OpenMP? Working code, it has a slight speed improvement but not the ~2x I was looking for. <pre class="prettyprint"><code>!$omp parallel do private(i, j) shared(bodyArray, forces, n) default(none) schedule(guided) do i = 1, n do j = 1, i-1 forces(j, i)%vec = bodyArray(i)%accTo(bodyArray(j)) forces(i, j)%vec = -forces(j, i)%vec end do end do !$omp parallel do private(i, j) shared(bodyArray, n, forces) schedule(static) do i = 1, n do j = 1, n bodyArray(i)%acc = bodyArray(i)%acc + forces(j, i)%vec end do end do </code></pre>

With your current approach and data structures you're going to struggle to get good speedup with OpenMP. Consider the loop nest <pre class="prettyprint"><code>!$omp parallel do private(i, j) shared(bodyArray, n) default(none) do i = 1, n do j = i, n if ( i /= j .and. j > i) then bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j)) bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc end if end do end do </code></pre> [Actually, before you consider it, revise it as follows ... <pre class="prettyprint"><code>!$omp parallel do private(i, j) shared(bodyArray, n) default(none) do i = 1, n do j = i+1, n bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j)) bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc end do end do </code></pre> ..., now back to the issues] There are two problems here: <ol> <li>As you've already twigged, you've got a data race updating <code>bodyArray(j)%acc</code>; multiple threads will try to update the same element and there is no coordination of those updates. Junk results. Using <code>critical</code> sections or ordering the statements serialises the code; when you get it right you also get it as slow as it was before you started with OpenMP.</li> <li>The pattern of access to elements of <code>bodyArray</code> is cache-unfriendly. It wouldn't surprise me to find that, even if you address the data race without serialising the computation, the impact of the cache-unfriendliness is to produce code slower than the original. Modern CPUs are crazy-fast in computation but the memory systems struggle to feed the beasts so cache effects can be massive. Trying to run two loops over the same rank-1 array simultaneously, which is in essence what your code does, is never (?) going to shift data through cache at maximum speed.</li> </ol> Personally I would try the following. I'm not going to guarantee that this will be faster, but it will be easier (I think) than fixing your current approach and fit OpenMP like a glove. I do have a nagging doubt that this is overcomplicating matters, but I haven't had a better idea yet. First, create a 2D array of reals, call it <code>forces</code>, where element <code>force(i,j)</code> is the force that element <code>i</code> exerts on <code>j</code>. Then, some code like this (untested, that's your responsibility if you care to follow this line) <pre class="prettyprint"><code>forces = 0.0 ! Parallelise this if you want to !$omp parallel do private(i, j) shared(forces, n) default(none) do i = 1, n do j = 1, i-1 forces(i,j) = bodyArray(i)%accTo(bodyArray(j)) ! if I understand correctly end do end do </code></pre> then sum the forces on each particle (and get the following right, I haven't checked carefully) <pre class="prettyprint"><code>!$omp parallel do private(i) shared(bodyArray,forces, n) default(none) do i = 1, n bodyArray(i)%acc = sum(forces(:,i)) end do </code></pre> As I wrote above, computation is extremely fast and if you have the memory to spare it's often well worth trading some space for time. Now what you have is, probably, a problem with load balancing in the loop nest over <code>forces</code>. Most OpenMP implementations will, by default, perform a static distribution of work (this is not required by the standard but seems to be most common, check your documentation). So thread 1 will get the first <code>n/num_threads</code> rows to deal with, but these are the itty-bitty little rows at the top of the triangle you're computing. Thread 2 will get more work, thread 3 still more, and so forth. You might get away with simply adding a <code>schedule(dynamic)</code> clause to the <code>parallel</code> directive, you might have to work a bit harder to balance the load. You may also want to review my code snippets wrt cache-friendliness and adjust as appropriate. And you may well find, if you do as I suggest, that you were better off with your original code, that halving the amount of computation doesn't actually save much time. Another approach would be to pack the lower (or upper) triangle of <code>forces</code> into a rank-1 array and use some fancy indexing arithmetic to transform 2D <code>(i,j)</code> indices into a 1D index into that array. This would save storage space, and might be easier to make cache-friendly.

Parallelizing nested loop with OpenMP

Tags:

fortran

openmp

I have a relatively simple loop where I'm calculating the net acceleration of a system of particles using a brute-force method.

I have a working OpenMP loop which loops over each particles and compares it to every other particles for an n^2 complexity here:

   !$omp parallel do private(i) shared(bodyArray, n) default(none)
   do i = 1, n
        !acc is real(real64), dimension(3)
        bodyArray(i)%acc = bodyArray(i)%calcNetAcc(i, bodyArray)
   end do

which works just fine.

What I'm trying to do now is to reduce my calculation time by only computing the force on each body once using the fact that the force from F(a->b) = -F(b->a), reducing the number of interactions to calculate by half (n^2 / 2). Which I do in this loop:

call clearAcceleration(bodyArray) !zero out acceleration

    !$omp parallel do private(i, j) shared(bodyArray, n) default(none)
       do i = 1, n
          do j = i, n

             if ( i /= j .and. j > i) then 
                bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
                bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
             end if

          end do
       end do

But I'm having a lot of difficulty with this parallelizing this loop, I keep getting junk results. I think it has to do with this line:

bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc

and that the forces are not being added up properly with all the different 'j' writing to it. I've tried using the atomic statement, but that's not allowed on array variables. So then I tried critical, but that increases the time it takes by about 20, and still doesn't give correct results. I also tried adding an ordered statement, but then I just get NaN for all my results. Is there an easy fix to get this loop working with OpenMP?

Working code, it has a slight speed improvement but not the ~2x I was looking for.

!$omp parallel do private(i, j) shared(bodyArray, forces, n) default(none) schedule(guided)
      do i = 1, n
         do j = 1, i-1
               forces(j, i)%vec = bodyArray(i)%accTo(bodyArray(j))
               forces(i, j)%vec = -forces(j, i)%vec
         end do
      end do

      !$omp parallel do private(i, j) shared(bodyArray, n, forces) schedule(static)
      do i = 1, n
         do j = 1, n
            bodyArray(i)%acc = bodyArray(i)%acc + forces(j, i)%vec
         end do
      end do

982

asked Aug 07 '14 06:08

Exascale

1 Answers

With your current approach and data structures you're going to struggle to get good speedup with OpenMP. Consider the loop nest

!$omp parallel do private(i, j) shared(bodyArray, n) default(none)
   do i = 1, n
      do j = i, n

         if ( i /= j .and. j > i) then 
            bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
            bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
         end if

      end do
   end do

[Actually, before you consider it, revise it as follows ...

!$omp parallel do private(i, j) shared(bodyArray, n) default(none)
   do i = 1, n
      do j = i+1, n

            bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
            bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc

      end do
   end do

..., now back to the issues]

There are two problems here:

As you've already twigged, you've got a data race updating bodyArray(j)%acc; multiple threads will try to update the same element and there is no coordination of those updates. Junk results. Using critical sections or ordering the statements serialises the code; when you get it right you also get it as slow as it was before you started with OpenMP.
The pattern of access to elements of bodyArray is cache-unfriendly. It wouldn't surprise me to find that, even if you address the data race without serialising the computation, the impact of the cache-unfriendliness is to produce code slower than the original. Modern CPUs are crazy-fast in computation but the memory systems struggle to feed the beasts so cache effects can be massive. Trying to run two loops over the same rank-1 array simultaneously, which is in essence what your code does, is never (?) going to shift data through cache at maximum speed.

Personally I would try the following. I'm not going to guarantee that this will be faster, but it will be easier (I think) than fixing your current approach and fit OpenMP like a glove. I do have a nagging doubt that this is overcomplicating matters, but I haven't had a better idea yet.

First, create a 2D array of reals, call it forces, where element force(i,j) is the force that element i exerts on j. Then, some code like this (untested, that's your responsibility if you care to follow this line)

forces = 0.0  ! Parallelise this if you want to
!$omp parallel do private(i, j) shared(forces, n) default(none)
   do i = 1, n
      do j = 1, i-1
            forces(i,j) = bodyArray(i)%accTo(bodyArray(j)) ! if I understand correctly
      end do
   end do

then sum the forces on each particle (and get the following right, I haven't checked carefully)

!$omp parallel do private(i) shared(bodyArray,forces, n) default(none)
do i = 1, n
      bodyArray(i)%acc = sum(forces(:,i))
end do

As I wrote above, computation is extremely fast and if you have the memory to spare it's often well worth trading some space for time.

Now what you have is, probably, a problem with load balancing in the loop nest over forces. Most OpenMP implementations will, by default, perform a static distribution of work (this is not required by the standard but seems to be most common, check your documentation). So thread 1 will get the first n/num_threads rows to deal with, but these are the itty-bitty little rows at the top of the triangle you're computing. Thread 2 will get more work, thread 3 still more, and so forth. You might get away with simply adding a schedule(dynamic) clause to the parallel directive, you might have to work a bit harder to balance the load.

You may also want to review my code snippets wrt cache-friendliness and adjust as appropriate. And you may well find, if you do as I suggest, that you were better off with your original code, that halving the amount of computation doesn't actually save much time.

Another approach would be to pack the lower (or upper) triangle of forces into a rank-1 array and use some fancy indexing arithmetic to transform 2D (i,j) indices into a 1D index into that array. This would save storage space, and might be easier to make cache-friendly.

147

answered Oct 29 '22 16:10

High Performance Mark

Related questions
                            
                                Fortran 2003 bindings to library in C: how to translate enums and #defines?
                            
                                Generic type-bound procedures with procedure arguments
                            
                                How to preserve Fortran array bounds in ASSOCIATE block?
                            
                                Algorithm [preferably fortran] to interpolate data from a 2D unstructured grid to form a cartesian grid
                            
                                Debug Fortran code from within R
                            
                                Linking LAPACK in Fortran on Mac OS X
                            
                                fortran : trying to make a minimal stack datastructure
                            
                                Convincing meld to ignore specific lines of code?
                            
                                Calling a FORTRAN subroutine from C#
                            
                                Can we indeed avoid goto in all cases?
                            
                                Link OSX Homebrew Gfortran against libc++
                            
                                Calling C function with **int parameter from Fortran
                            
                                Constant function pointer array in Fortran 2003
                            
                                How to get rid of unwanted spacing in Fortran's print output?
                            
                                Is omp barrier equivalent to omp end parallel in Fortran
                            
                                A good OOP design for a solver with modern Fortran
                            
                                Linking fortran module: "undefined reference"
                            
                                Integrating CUDA into a C++ application to use existing C++ class
                            
                                Specify directory where gfortran should look for modules
                            
                                Mac OS X R error "ld: warning: directory not found for option"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With