OpenMP code far slower than serial - memory or thread overhead bottleneck?

Tags:

I am trying to parallelize (OpenMP) some scientific C++ code where the bulk (>95%) of the CPU time is spent on a calculating a nasty (and unavoidable) O(N^2) interaction for of order N~200 different particles. This calculation is repeated for 1e10 time steps. I have tried various different configurations with OpenMP, each slower than the serial code by some margin (at least order of magnitude) with poor scaling as additional cores are added.

Below is a sketch of the pertinent code, with representative dummy data hierarchy Tree->Branch->Leaf. Each Leaf object stores its own position and velocities for current and previous three time steps, amongst other things. Each Branch then stores a collection of Leaf objects and each Tree stores a collection of Branch objects. This data structure works very well for complex but less CPU-intensive calculations that must also be performed at each time step (that have taken months to perfect).

#include <omp.h>

#pragma omp parallel num_threads(16) // also tried 2, 4 etc - little difference - hoping that placing this line here spawns the thread pool at the onset rather than at every step
{
while(i < t){
    #pragma omp master
    {
       /* do other calculations on single core, output etc.  */
       Tree.PreProcessing() 
       /* PreProcessing can drastically change data for certain conditions, but only at 3 or 4 of the 1e10 time steps */
       Tree.Output()
    }
    #pragma omp barrier
    #pragma omp for schedule(static) nowait
    for(int k=0; k < size; k++){
         /* do O(N^2) calc that requires position of all other leaves */
         Tree.CalculateInteraction(Branch[k]) 
    }
    /* return to single core to finish time step */
    #pragma omp master
    {
        /* iterate forwards */
        Tree.PropagatePositions()
        i++
    }
    #pragma omp barrier
}

Very briefly, the CPU-hog function does this:

void Tree::CalculateInteraction(Leaf* A){
// for all branches B in tree{
       // for all leaves Q in B{
          if(condition between A and Q){skip}
          else{
                // find displacement D of A and Q 
                // find displacement L of A and "A-1"
                // take cross product of the two displacements
                // add the cross-product to the velocity of leaf A 
                for(int j(0); j!=3; j++){
                    A->Vel[j] += constant * (D_cross_L)[j];
                }

My question is whether this crippling of performance is due to the openMP thread management overhead dominating, or whether it is a case of a data hierarchy designed with no thought to parallelism?

I should note that each step is timed to be considerably longer in parallel than serial, this isn't some initialisation overhead issue; the two versions have been tested for calculations that take 1 vs 10 hrs and eventually want to be applied to serial calculations that can take 30 hours (for which getting even a 2 times speed up would be very beneficial). Also, it may be worth knowing that I'm using g++ 5.2.0 with -fopenmp -march=native -m64 -mfpmath=sse -Ofast -funroll-loops.

I am new to OpenMP so any tips would be greatly appreciated, please let me know if anything should be clarified.

223

asked Aug 11 '15 16:08

Matthew

2 Answers

Thanks for providing the link to the original source! I've been able to compile and get some stats on two platforms: a Xeon E5-2670 with icpc 15.0 and g++ 4.9.0; and on a Core i7-4770, with g++ 4.8.4.

On the Xeon, both icpc and g++ produced code that scaled with the number of threads. I ran a shortened (3e-7 second) simulation derived from the run.in file in the distribution:

Xeon E5-2670 / icpc 15.0
threads   time   ipc
---------------------
1         17.5   2.17
2         13.0   1.53
4          6.81  1.53
8          3.81  1.52

Xeon E5-2670 / g++ 4.9.0
threads   time   ipc
---------------------
1         13.2   1.75
2          9.38  1.28
4          5.09  1.27
8          3.07  1.25

On the Core i7, I did see the ugly scaling behaviour you observed, with g++ 4.8.4:

Core i7-4770 / g++ 4.8.4
threads   time   ipc
---------------------
1          8.48  2.41
2         11.5   0.97
4         12.6   0.73

The first observation is that there is something platform-specific affecting the scaling.

I had a look in the point.h and velnl.cpp files, and noticed that you were using vector<double> variables to store 3-d vector data, including many temporaries. These will all access the heap, and is a potential source of contention. Intel's openmp implementation uses thread-local heaps to avoid heap contention, and perhaps g++ 4.9 does too, while g++-4.8.4 does not?

I forked the project (halfflat/vfmcppar on github) and modified these files to use std::array<double,3> for these 3-d vectors; this restores scaling, and also gives much faster run times:

Core i7-4770 / g++ 4.8.4
std::array implementation
threads   time   ipc
---------------------
1          1.40  1.54
2          0.84  1.35
4          0.60  1.11

I haven't run these tests on a decent-length simulation, so some scaling could well be lost due to set up and i/o overhead.

The take-away point is that any shared resource can frustrate scalability, including the heap.

114

answered Sep 21 '22 01:09

halfflat

You problem is most likely false sharing due to your use of linked lists for the nodes. With that memory layout, you not only have the problem of cache misses at almost every time you walk the tree to another node (as mentioned by halfflat).

A more severe problem is that tree nodes accessed and modified from different threads may be actually close in memory. If they share a cache line, this means that false sharing (or cache ping-pong) causes repeated re-syncing of cache lines shared between different threads.

The solution to both problems is to avoid linked data structures. They are almost always the cause of low efficiency. In your case, the solution is to first build a linked-list tree with minimal data (only those needed to defined the tree) and then to map that to another tree that doesn't use linked lists and may contain more data. This is what I do and the tree traversal is reasonably fast (tree walk can never be really fast, since cache misses are unavoidable even with contiguous sister nodes, since parent-daughter access cannot be contiguous at the same time). A significant speed up (factor>2) can be obtained for the tree building if you add the particles to the new tree in the order of the old tree (this avoids cache misses).

answered Sep 18 '22 01:09

Walter

Related questions
                            
                                Increased speed despite false sharing
                            
                                Is an attempt to modify a const_cast-ed, but dynamically allocated constant object still undefined behavior? [duplicate]
                            
                                Wrapper for `__m256` Producing Segmentation Fault with Constructor - Windows 64 + MinGW + AVX Issues
                            
                                std::vector::resize(size_type) requires CopyInsertable?
                            
                                Waking up thread from signal handler
                            
                                Passing a reference-to-function as a universal reference
                            
                                what does "<vtable for A+16>" mean?
                            
                                Boost serialization of template derived class
                            
                                Is there a std::function type or similar for lambda with auto parameter?
                            
                                Member function pointer issue with standard library methods
                            
                                QT Clickable widget (possibly button) inside QTreeWidget?
                            
                                Creating derived class instance using base class instance
                            
                                IPC: Using of named pipes in c++ between two programs
                            
                                Winner of a tournament in O(N) and rank of the players in O(NLogN)
                            
                                CUDA cudaMemcpy Struct of Arrays
                            
                                Finding a number at a particular digit in a number [closed]
                            
                                c++ long double printing all digits with precision
                            
                                Solving a simple matrix in row-reduced form in C++
                            
                                How to get the full path for USN journal query?
                            
                                Why is it allowed in C++ to modify a constant object's pointer member variable's memory from outside?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

OpenMP code far slower than serial - memory or thread overhead bottleneck?

Tags:

c++

performance

multithreading

parallel-processing

openmp

Matthew

People also ask

2 Answers

halfflat

Walter

Recent Activity

Donate For Us