Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenMP code far slower than serial - memory or thread overhead bottleneck?

I am trying to parallelize (OpenMP) some scientific C++ code where the bulk (>95%) of the CPU time is spent on a calculating a nasty (and unavoidable) O(N^2) interaction for of order N~200 different particles. This calculation is repeated for 1e10 time steps. I have tried various different configurations with OpenMP, each slower than the serial code by some margin (at least order of magnitude) with poor scaling as additional cores are added.

Below is a sketch of the pertinent code, with representative dummy data hierarchy Tree->Branch->Leaf. Each Leaf object stores its own position and velocities for current and previous three time steps, amongst other things. Each Branch then stores a collection of Leaf objects and each Tree stores a collection of Branch objects. This data structure works very well for complex but less CPU-intensive calculations that must also be performed at each time step (that have taken months to perfect).

#include <omp.h>

#pragma omp parallel num_threads(16) // also tried 2, 4 etc - little difference - hoping that placing this line here spawns the thread pool at the onset rather than at every step
{
while(i < t){
    #pragma omp master
    {
       /* do other calculations on single core, output etc.  */
       Tree.PreProcessing() 
       /* PreProcessing can drastically change data for certain conditions, but only at 3 or 4 of the 1e10 time steps */
       Tree.Output()
    }
    #pragma omp barrier
    #pragma omp for schedule(static) nowait
    for(int k=0; k < size; k++){
         /* do O(N^2) calc that requires position of all other leaves */
         Tree.CalculateInteraction(Branch[k]) 
    }
    /* return to single core to finish time step */
    #pragma omp master
    {
        /* iterate forwards */
        Tree.PropagatePositions()
        i++
    }
    #pragma omp barrier
}

Very briefly, the CPU-hog function does this:

void Tree::CalculateInteraction(Leaf* A){
// for all branches B in tree{
       // for all leaves Q in B{
          if(condition between A and Q){skip}
          else{
                // find displacement D of A and Q 
                // find displacement L of A and "A-1"
                // take cross product of the two displacements
                // add the cross-product to the velocity of leaf A 
                for(int j(0); j!=3; j++){
                    A->Vel[j] += constant * (D_cross_L)[j];
                }

My question is whether this crippling of performance is due to the openMP thread management overhead dominating, or whether it is a case of a data hierarchy designed with no thought to parallelism?

I should note that each step is timed to be considerably longer in parallel than serial, this isn't some initialisation overhead issue; the two versions have been tested for calculations that take 1 vs 10 hrs and eventually want to be applied to serial calculations that can take 30 hours (for which getting even a 2 times speed up would be very beneficial). Also, it may be worth knowing that I'm using g++ 5.2.0 with -fopenmp -march=native -m64 -mfpmath=sse -Ofast -funroll-loops.

I am new to OpenMP so any tips would be greatly appreciated, please let me know if anything should be clarified.

like image 223
Matthew Avatar asked Aug 11 '15 16:08

Matthew


People also ask

Why is OpenMP slower?

OPENMP Practice Slower than serial – there is no for directive, so every thread executes this loop in its entirety. n threads running n loops at the same time will actually execute in the same time as 1 thread running 1 loop.

How many threads can I use in OpenMP?

First of all,you can't run more than 8 threads.


2 Answers

Thanks for providing the link to the original source! I've been able to compile and get some stats on two platforms: a Xeon E5-2670 with icpc 15.0 and g++ 4.9.0; and on a Core i7-4770, with g++ 4.8.4.

On the Xeon, both icpc and g++ produced code that scaled with the number of threads. I ran a shortened (3e-7 second) simulation derived from the run.in file in the distribution:

Xeon E5-2670 / icpc 15.0
threads   time   ipc
---------------------
1         17.5   2.17
2         13.0   1.53
4          6.81  1.53
8          3.81  1.52

Xeon E5-2670 / g++ 4.9.0
threads   time   ipc
---------------------
1         13.2   1.75
2          9.38  1.28
4          5.09  1.27
8          3.07  1.25

On the Core i7, I did see the ugly scaling behaviour you observed, with g++ 4.8.4:

Core i7-4770 / g++ 4.8.4
threads   time   ipc
---------------------
1          8.48  2.41
2         11.5   0.97
4         12.6   0.73

The first observation is that there is something platform-specific affecting the scaling.

I had a look in the point.h and velnl.cpp files, and noticed that you were using vector<double> variables to store 3-d vector data, including many temporaries. These will all access the heap, and is a potential source of contention. Intel's openmp implementation uses thread-local heaps to avoid heap contention, and perhaps g++ 4.9 does too, while g++-4.8.4 does not?

I forked the project (halfflat/vfmcppar on github) and modified these files to use std::array<double,3> for these 3-d vectors; this restores scaling, and also gives much faster run times:

Core i7-4770 / g++ 4.8.4
std::array implementation
threads   time   ipc
---------------------
1          1.40  1.54
2          0.84  1.35
4          0.60  1.11

I haven't run these tests on a decent-length simulation, so some scaling could well be lost due to set up and i/o overhead.

The take-away point is that any shared resource can frustrate scalability, including the heap.

like image 114
halfflat Avatar answered Sep 21 '22 01:09

halfflat


You problem is most likely false sharing due to your use of linked lists for the nodes. With that memory layout, you not only have the problem of cache misses at almost every time you walk the tree to another node (as mentioned by halfflat).

A more severe problem is that tree nodes accessed and modified from different threads may be actually close in memory. If they share a cache line, this means that false sharing (or cache ping-pong) causes repeated re-syncing of cache lines shared between different threads.

The solution to both problems is to avoid linked data structures. They are almost always the cause of low efficiency. In your case, the solution is to first build a linked-list tree with minimal data (only those needed to defined the tree) and then to map that to another tree that doesn't use linked lists and may contain more data. This is what I do and the tree traversal is reasonably fast (tree walk can never be really fast, since cache misses are unavoidable even with contiguous sister nodes, since parent-daughter access cannot be contiguous at the same time). A significant speed up (factor>2) can be obtained for the tree building if you add the particles to the new tree in the order of the old tree (this avoids cache misses).

like image 26
Walter Avatar answered Sep 18 '22 01:09

Walter