Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

N-body algorithm: why is this slower in parallel?

I put together a sample program that mimics the type of data structure I am dealing with. Namely that I have n objects, and I need to iterate between each possible pair once and perform a (symmetric) calculation. This operation involves writing data to both pairs. In serial, this would take the form of a loop like this

for(int i = 0; i < N-1; ++i)
   for(int j = i + 1; j < N; ++j)
      ...

However, it did not take much searching on the Internet to find a "cache oblivious parallel implementation", which I wrote up and reproduced below. I've linked here a post (which uses Intel TBB) that describes this algorithm in detail.

https://software.intel.com/en-us/blogs/2010/07/01/n-bodies-a-parallel-tbb-solution-parallel-code-balanced-recursive-parallelism-with-parallel_invoke

I tried using OpenMP tasks to perform the same thing, and it always runs slower than the serial counterpart (simply compiling without -fopenmp). I compile it with g++ -Wall -std=c++11 -O3 test.cpp -o test. The same is observed with or without -O3; the serial is always faster.

To add a bit more information, in my real application, there are typically between a few hundred and a few thousand elements (variable n in the example below) that I need to loop over in this pair-wise fashion many times. Millions of times. My attempt below tries to simulate that (though I only tried looping 10-100k times).

I've timed this very crudely using time ./test simply because there's so much of a difference. Yes, I know my example is poorly written, and that I am including the time required to create the vector in my example. But time for serial gives me ~30 seconds and over a minute in parallel so I don't think I need to do anything more rigorous just yet.

My question is: Why does the serial do better? Have I done something incorrect in OpenMP? How do I properly correct my mistake(s)? Have I misused tasks? I have a feeling that the recursive tasking has something to do with it, and I tried setting 'OMP_THREAD_LIMIT' to 4, but it did not make a substantial difference. Is there a better way of implementing this using OpenMP?

Note: my question is specifically asking how to fix this particular implementation so that it works properly in parallel. Though if anyone knows an alternative solution to this problem and its proper implementation in OpenMP, I am open to that as well.

Thanks in advance.

#include <vector>
#include <iostream>

std::vector<std::vector<double>> testme;

void rect(int i0, int i1, int j0, int j1)
{
    int di = i1 - j0;
    int dj = j1 - j0;
    constexpr int threshold = 16;
    if(di > threshold && dj > threshold)
    {
        int im = i0 + di/2;
        int jm = j0 + dj/2;
        #pragma omp task
        { rect(i0, im, j0, jm); }
        rect(im, i1, jm, j1);
        #pragma omp taskwait

        #pragma omp task 
        { rect(i0, im, jm, j1); }
        rect(im, i1, j0, jm);
        #pragma omp taskwait
    }
    else
    {
        for(int i = i0; i < i1; ++i)
            for(int j = j0; j < j1; ++j) {
                testme[i][j] = 1;
                testme[j][i] = 1;
            }

    }
}

void triangle(int n0, int n1)
{
        int dn = n1 - n0;
        if(dn > 1)
        {
            int nm = n0 + dn/2;
            #pragma omp task
            { triangle(n0, nm); }
            triangle(nm, n1);
            #pragma omp taskwait

            rect(n0, nm, nm, n1);
        }
}


void calc_force(int nbodies)
{
    #pragma omp parallel num_threads(4)
    #pragma omp single
    triangle(0, nbodies);
}

int main()
{
    int n = 400;
    for(int i = 0; i < n; ++i)
    {
        std::vector<double> temp;
        for(int j = 0; j < n; ++j)
            temp.push_back(0);

        testme.push_back(temp);
    }

    // Repeat a bunch of times.
    for(int i = 0; i < 100000; ++i)
        calc_force(n);

    return 0;
}   
like image 423
pragmatist1 Avatar asked Oct 04 '15 14:10

pragmatist1


4 Answers

Your current implementation of OMP tasks seems to be completely right in applying the triangle partitioning scheme. It seems that due to the recursive nature of the decomposition, the current code is just creating too many child tasks, calling the recursive triangle program until the condition of dn = 1 is reached (at the bottom of the tree). The granularity is just too high. This is burdening your program with the communication requirements of creating and completing task with ever less benefit of that task creation; therefore overweighing the parallelism benefits. I would try to cut off the recursive triangle task call at a certain dn value greater than 1 (more around 15 I am guessing) and let the last (lowest) task to execute serially.

The thread limit will only limit the number of threads active but not the number of recursive calls or tasks made. I would try a task if or adding an else to your triangle implementation.

Something like this:

#include <vector>
#include <iostream>

std::vector<std::vector<double>> testme;

void rect(int i0, int i1, int j0, int j1)
{
int di = i1 - j0;
int dj = j1 - j0;
constexpr int threshold = 64;
if(di > threshold && dj > threshold)
{
    int im = i0 + di/2;
    int jm = j0 + dj/2;
    #pragma omp task
    { rect(i0, im, j0, jm); }
    rect(im, i1, jm, j1);
    #pragma omp taskwait

    #pragma omp task 
    { rect(i0, im, jm, j1); }
    rect(im, i1, j0, jm);
    #pragma omp taskwait
}
else
{
 // #pragma omp parallel for collapse(2)  (was not implimented during testing)
    for(int i = i0; i < i1; ++i)
        for(int j = j0; j < j1; ++j) {
            testme[i][j] = 1;
            testme[j][i] = 1;
        }
    }
}

void triangle(int n0, int n1)
{
    int dn = n1 - n0;
    if(dn > 1)
    {
        int nm = n0 + dn/2;
        #pragma omp task if(nm > 50 )

        { triangle(n0, nm); }
        triangle(nm, n1);
       #pragma omp taskwait

       rect(n0, nm, nm, n1);
    }
}


void calc_force(int nbodies)
{
#pragma omp parallel num_threads(4)
#pragma omp single
triangle(0, nbodies);
}

int main()
{
int n = 400;
for(int i = 0; i < n; ++i)
{
    std::vector<double> temp;
    for(int j = 0; j < n; ++j)
        temp.push_back(0);

    testme.push_back(temp);
}

// Repeat a bunch of times.
for(int i = 0; i < 100000; ++i)
    calc_force(n);

return 0;
}  

NOTE: It also might very well be the case that this implementation only shows speed up at scale, where the task overhead is outweighed by the calculation intensity of your program.

like image 85
Walter Simson Avatar answered Nov 10 '22 11:11

Walter Simson


The simple idea of using a recursive algorithm for such a workload seems already very strange to me. And then, to parallelise it using OpenMP tasks seems even stranger... Why not tackling the problem with a more conventional approach?

So I decided to give it a go with a few methods that came to my mind. But to make the exercise sensible, it was important that some actual work was done for computing the "symmetric calculation", otherwise, just iterating on a all elements without accounting for the symmetrical aspect would certainly be the best option.

So I wrote a a force() function computing something loosely related to gravitational interactions between two bodies, based on there coordinates. Then I tested 4 different methods to iterate on the particles:

  1. A naïve triangular approach such as the one you proposed. Due to it's intrinsic load-unbalanced aspect, this one is parallelised with a schedule(auto) clause to permit the run-time library to take the decision it thinks best for performance.
  2. A "clever" traversal of the triangular domain, consisting in cutting it in half in the j direction to allow for using 2 regular loops. Basically, it corresponds to to something like this:

    . /| / | __ __ / | => | // | /___| |//____|

  3. A straightforward rectangular approach, just ignoring the symmetry. NB, this one, like your recursive approach, guaranties non-concurrent accesses to the force array.
  4. A linearised method consisting in pre-computing the order of i and j indexes for accessing the triangular domain, and to iterate over the vector containing these indexes.

Since the vector where the forces are accumulated with a force[i] += fij; force[j] -= fij; approach would generate race conditions for the updates in the non-parallelised index (j for example in the method #1), I've created a local pre-thread force array, which is initialised to 0 upon entry to the parallel region. The computations are then done pre-thread on this "private" array, and the individual contributions are accumulated on the global force array with a critical construct upon exit of the parallel region. This is a typical reduction pattern for arrays in OpenMP.

Here is the full code for this:

#include <iostream>
#include <vector>
#include <cstdlib>
#include <cmath>
#include <omp.h>

std::vector< std::vector<double> > l_f;
std::vector<double> x, y, f;
std::vector<int> vi, vj;

int numth, tid;
#pragma omp threadprivate( tid )

double force( int i, int j ) {
    double dx = x[i] - x[j];
    double dy = y[i] - y[j];
    double dist2 = dx*dx + dy*dy;
    return dist2 * std::sqrt( dist2 );
}

void loop1( int n ) {
    #pragma omp parallel
    {
        for ( int i = 0; i < n; i++ ) {
            l_f[tid][i] = 0;
        }
        #pragma omp for schedule(auto) nowait
        for ( int i = 0; i < n-1; i++ ) {
            for ( int j = i+1; j < n; j++ ) {
                double fij = force( i, j );
                l_f[tid][i] += fij;
                l_f[tid][j] -= fij;
            }
        }
        #pragma omp critical
        for ( int i = 0; i < n; i++ ) {
            f[i] += l_f[tid][i];
        }
    }
}

void loop2( int n ) {
    int m = n/2-1+n%2;
    #pragma omp parallel
    {
        for ( int i = 0; i < n; i++ ) {
            l_f[tid][i] = 0;
        }
        #pragma omp for schedule(static) nowait
        for ( int i = 0; i < n; i++ ) { 
            for ( int j = 0; j < m; j++ ) {
                int ii, jj;
                if ( j < i ) {
                    ii = n-1-i;
                    jj = n-1-j;
                }
                else {
                    ii = i;
                    jj = j+1;
                }
                double fij = force( ii, jj );
                l_f[tid][ii] += fij;
                l_f[tid][jj] -= fij;
            }
        }
        if ( n%2 == 0 ) {
            #pragma omp for schedule(static) nowait
            for ( int i = 0; i < n/2; i++ ) {
                double fij = force( i, n/2 );
                l_f[tid][i] += fij;
                l_f[tid][n/2] -= fij;
            }
        }
        #pragma omp critical
        for ( int i = 0; i < n; i++ ) {
            f[i] += l_f[tid][i];
        }
    }
}

void loop3( int n ) {
    #pragma omp parallel for schedule(static)
    for ( int i = 0; i < n; i++ ) {
        for ( int j = 0; j < n; j++ ) {
            if ( i < j ) {
                f[i] += force( i, j );
            }
            else if ( i > j ) {
                f[i] -= force( i, j );
            }
        }
    }
}

void loop4( int n ) {
    #pragma omp parallel
    {
        for ( int i = 0; i < n; i++ ) {
            l_f[tid][i] = 0;
        }
        #pragma omp for schedule(static) nowait
        for ( int k = 0; k < vi.size(); k++ ) {
            int i = vi[k];
            int j = vj[k];
            double fij = force( i, j );
            l_f[tid][i] += fij;
            l_f[tid][j] -= fij;
        }
        #pragma omp critical
        for ( int i = 0; i < n; i++ ) {
            f[i] += l_f[tid][i];
        }
    }
}

int main( int argc, char *argv[] ) {
    if ( argc != 2 ) {
        std::cout << "need the dim as argument\n";
        return 1;
    }
    int n = std::atoi( argv[1] );

    // initialise data
    f.resize( n );
    x.resize( n );
    y.resize( n );
    for ( int i = 0; i < n; ++i ) {
        x[i] = y[i] = i;
        f[i] = 0;
    }

    // initialising linearised index vectors
    for ( int i = 0; i < n-1; i++ ) {
        for ( int j = i+1; j < n; j++ ) {
            vi.push_back( i );
            vj.push_back( j );
        }
    }
    // initialising the local forces vectors
    #pragma omp parallel
    {
        tid = omp_get_thread_num();
        #pragma master
        numth = omp_get_num_threads();
    }
    l_f.resize( numth );
    for ( int i = 0; i < numth; i++ ) {
        l_f[i].resize( n );
    }

    // testing all methods one after the other, with a warm up before each
    int lmax = 10000;
    loop1( n );
    double tbeg = omp_get_wtime();
    for ( int l = 0; l < lmax; l++ ) {
        loop1( n );
    }
    double tend = omp_get_wtime();
    std::cout << "Time for triangular loop is " << tend-tbeg << "s\n";

    loop2( n );
    tbeg = omp_get_wtime();
    for ( int l = 0; l < lmax; l++ ) {
        loop2( n );
    }
    tend = omp_get_wtime();
    std::cout << "Time for mangled rectangular loop is " << tend-tbeg << "s\n";

    loop3( n );
    tbeg = omp_get_wtime();
    for ( int l = 0; l < lmax; l++ ) {
        loop3( n );
    }
    tend = omp_get_wtime();
    std::cout << "Time for naive rectangular loop is " << tend-tbeg << "s\n";

    loop4( n );
    tbeg = omp_get_wtime();
    for ( int l = 0; l < lmax; l++ ) {
        loop4( n );
    }
    tend = omp_get_wtime();
    std::cout << "Time for linearised loop is " << tend-tbeg << "s\n";

    int ret = f[n-1];
    return ret;
}

Now, it becomes simple to evaluate there relative performance and scalability. All methods are timed on a loop, after a first non-timed warm-up iteration.

Compilation: g++ -O3 -mtune=native -march=native -fopenmp tbf.cc -o tbf

Results on a 8 cores IvyBridge CPU:

> OMP_NUM_THREADS=1 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 9.21198s
Time for mangled rectangular loop is 10.1316s
Time for naive rectangular loop is 15.9408s
Time for linearised loop is 10.6449s
> OMP_NUM_THREADS=2 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 6.84671s
Time for mangled rectangular loop is 5.13731s
Time for naive rectangular loop is 8.09542s
Time for linearised loop is 5.4654s
> OMP_NUM_THREADS=4 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 4.03016s
Time for mangled rectangular loop is 2.90809s
Time for naive rectangular loop is 4.45373s
Time for linearised loop is 2.7733s
> OMP_NUM_THREADS=8 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 2.31051s
Time for mangled rectangular loop is 2.05854s
Time for naive rectangular loop is 3.03463s
Time for linearised loop is 1.7106s

So in this case, the method #4 seems the best option with both good performance and very good scalability. Notice that the straightforward triangular approach isn't too bad, thanks to a good load-balancing job from the schedule(auto) directive. But ultimately, I would encourage you to test with your own workload...

For reference, your initial code, (modified for computing the force() the exact same way as for the other tests, including the number of OpenMP threads used, but without the need of local force arrays and final reduction, tanks to the recursive approach) gives this:

> OMP_NUM_THREADS=1 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 9.32888s
> OMP_NUM_THREADS=2 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 9.48718s
> OMP_NUM_THREADS=4 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 10.962s
> OMP_NUM_THREADS=8 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 13.2786
like image 35
Gilles Avatar answered Nov 10 '22 11:11

Gilles


One of the reasons the parallel code is not living up to its potential is due to a problem known as false sharing (Wikipedia).

The solution is to partition the problem so that each cache line in your output 2-D matrix (the inner vector) is updated by only one thread. That is true for the triangles by construction, the partition already guarantees that. But the parallelism in the rectangle is problematic if im and jm are not the index of entries that are aligned at a cache-line boundary. If the partition indicated by im and/or jm is not at a cache line boundary, then both threads will be writing to common cache lines but different offsets within the cache line - the definition of false sharing.

This article by Intel has a good description of false sharing and advice on how to avoid it.

https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads

I quote the relevant section for reference:

False sharing is a well-known performance issue on SMP systems, where each processor has a local cache. It occurs when threads on different processors modify variables that reside on the same cache line, as illustrated in Figure 1. This circumstance is called false sharing because each thread is not actually sharing access to the same variable. Access to the same variable, or true sharing, would require programmatic synchronization constructs to ensure ordered data access.

The source line shown in red in the following example code causes false sharing:

double sum=0.0, sum_local[NUM_THREADS];
#pragma omp parallel num_threads(NUM_THREADS)
{
 int me = omp_get_thread_num();
 sum_local[me] = 0.0;

 #pragma omp for
 for (i = 0; i < N; i++)
 sum_local[me] += x[i] * y[i];

 #pragma omp atomic
 sum += sum_local[me];
}  

There is a potential for false sharing on array sum_local. This array is dimensioned according to the number of threads and is small enough to fit in a single cache line. When executed in parallel, the threads modify different, but adjacent, elements of sum_local (the source line shown in red), which invalidates the cache line for all processors.

enter image description here Figure 1. False sharing occurs when threads on different processors modify variables that reside on the same cache line. This invalidates the cache line and forces a memory update to maintain cache coherency.

In Figure 1, threads 0 and 1 require variables that are adjacent in memory and reside on the same cache line. The cache line is loaded into the caches of CPU 0 and CPU 1 (gray arrows). Even though the threads modify different variables (red and blue arrows), the cache line is invalidated, forcing a memory update to maintain cache coherency.

Here's a lecture on the pros and cons of various ways of organizing the N-Body algorithm with OpenMP, but note the caveats by Walter in the comments, this lecture is more about programming than about physics, as Walter notes in an N-Body problem where some particles have a close encounter the computation of forces to determine the net force on a body (acceleration) and the integration to determine velocity and then again for position must be done carefully - a global step function is not appropriate.

http://www.cs.usask.ca/~spiteri/CMPT851/notes/nBody.pdf

In particular, see page 18, which I copy here for reference:

Applying Foster’s methodology Hence, the map of tasks to cores reduces to mapping particles to cores. Assuming the work done per step is roughly equal, a block partitioning that assigns roughly n/p particles per core should provide a well-balanced load. This assumption is valid for the case where symmetry is not taken advantage of when computing fi,j(t). When symmetry is taken advantage of, the loops for lower i will more expensive than those for larger i. In this case, a cyclic partition is more effective. However, in a shared-memory framework, a cyclic partition is almost certain to lead to a higher number of cache misses than a block partition. In a distributed-memory framework, the communication overhead involved with a cyclic partition will probably be greater than that for a block partition

like image 4
amdn Avatar answered Nov 10 '22 12:11

amdn


The art of task-based parallelism is to avoid both under- and over-subscription. This means that at some point one has to execute a task serially, since parallel execution becomes too slow due to the overheads (see also the discussion here). When this point is reached depends on the amount of work and the task scheduler.

In your rect() function, you already use a threshold to limit task-parallel execution to regions with more than threshold elements per side. But strange enough, you don't do this in triangle(). So my first line of attack would be to experiment with a similar technique in that routine too.

void triangle(int n0, int n1, const int threshold)
{
    int dn = n1 - n0;
    if(dn > threshold)
    {
        int nm = n0 + dn/2;
        #pragma omp task
        { triangle(n0, nm, threshold); }
        triangle(nm, n1, threshold);
        #pragma omp taskwait
        rect(n0, nm, nm, n1, threshold);  // pass threshold on
    } else {
        for(int i = n0; i < n1; ++i)
            for(int j = i+1; j < n1; ++j) {   // excludes self-interactions
                auto fij  = mutual_force(i,j);
                force[i] += fij;
                force[j] -= fij;
            }
    }
}

Note that I made threshold a run-time variable. This allows you to experiment with it to see how sensitive the timing depends on it. The usual dependence is a long valley with good scaling, but bad results for too large or too small values. For a good task scheduler, you want to generate many more tasks than threads, but also threshold much larger than 1, say 64-1024.

Of course, there is a squeeze between these two requirements: you cannot efficiently scale small problems to many threads, strong scaling has its limits (N operations cannot be shared between more than N threads).


It may well be that your problem is too small to parallelize efficiently in this way, in particular with only a few hundred particles. An alternative parallelisation strategy is to compute the forces for each pair twice and use simple for-loop-based parallelism

#pragma omp parallel for
for(int i=0; i<n; ++i) {
    force[i] = 0;
    for(int j=0; j<n; ++j)
        force[i] += mutual_force(i,j);
}

The compiler will find this very easy to optimise and static parallelism may be just fine too.

like image 1
Walter Avatar answered Nov 10 '22 10:11

Walter