Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenMP on a 2-socket system

I do some scientific computations in C++, and try to utilize OpenMP for the parallelisation of some of the loops. This worked well so far, e.g. on a Intel i7-4770 with 8 threads.

Setup

We have a small workstation which consists of two Intel CPUs (E5-2680v2) on one mainboard. The code works as long as it runs on 1 CPU with as many threads as I like. But as soon as I employ the second CPU, I observe incorrect results from time to time (around every 50th-100th time I run the code). This happens even when I use only 2 threads and assign them to the two different CPUs. As we have 5 of these workstations (all are identical), I ran the code on each one of them, and all show this problem.

The workstation runs on OpenSuse 13.1, kernel 3.11.10-7. The problem exists with g++ 4.8.1 and 4.9.0, and with Intel's icc 13.1.3.192 (albeit the problem doesn't occur that often with icc, but it is still there).

The symptom

The symptom can be described as follows:

  • I have a large array of std::complex: std::complex<double>* mFourierValues;
  • In the loop, I access and set each element. Each iteration accesses a different element, so I do not have concurrent accesses (I checked this): mFourierValues[idx] = newValue;
  • If I compare the set array-value to the input-value afterwards, roughly mFourierValues[idx] == newValue, this check fails from time to time (although not every time the results end up being incorrect).

So the symptom looks like I access elements concurrently without any synchronizations. However, when I store the indices in a std::vector (with a proper #pragma omp critical), all indicies are unique and in the correct range.

Questions

After several days of debugging, my suspicion grows that something else is going on, and that my code is correct. To me it looks like something weird is happening when the CPUs synchronize the caches with the main-memory.

Therefore, my questions are:

  • Can OpenMP even be used for such a system? (I haven't found a source which says no.)
  • Are there known bugs for such a situation (I haven't found any in the bug-trackers)?
  • Where is the problem probably located in your opinion?
    • My code (which seems to run fine on 1 CPU with multiple cores!),
    • the compilers (gcc, icc both!),
    • the OS,
    • the hardware (defect on all 5 workstations?)

Code

[Edit: Old code removed, see below]

Edit with minimal example

OK, I was finally able to produce a shorter (and self-consistent) code example.

About the code

  • Reserve some memory space. For an array on the stack, this would be accessed like: complex<double> mAllElements[tensorIdx][kappa1][kappa2][kappa3]. I.e. I have 3 rank-3-tensors (tensorIdx). Each tensor represents a 3-dimensional array, indexed by kappa1, kappa2 and kappa3.
  • I have 4 nested loops (over all 4 indices), whereas the kappa1 loop is the one that gets parallized (and is the outermost one). They are located in DoComputation().
  • In main(), I call DoComputation() once to get some reference values, and then I call it several times and compare the results. They should match exactly, but sometimes they don't.

Unfortunately, the code is still around 190 lines long. I tried to simplify it further (only 1 tensor of rank 1, etc.), but then I was never able to reproduce the problem. I guess it appears because the memory-accesses are non-aligned (the loop over tensorIdx is the innermost one) (I know, this is far from optimal.)

Furthermore, some delays were needed in appropriate places, to reproduce the bug. That is the reason for the nops() calls. Without them the code runs a lot faster, but so far hasn't shown the problem.

Note that I checked the critical part, CalcElementIdx(), again, and deem it correct (each element is accessed once). I also ran valgrind's memcheck, helgrind and drd (with proper recompiled libgomp), and all three gave no errors.

Output

Every second to third start of the program I get one or two mismatches. Example output:

41      Is exactly 0
42      Is exactly 0
43      Is exactly 0
44      Is exactly 0
45      348496
46      Is exactly 0
47      Is exactly 0
48      Is exactly 0
49      Is exactly 0

This is true for gcc and icc.

My question

My question is: Does the code below look correct to you? (Apart from obvious design flaws.) (If it is too long, I will try to reduce it further, but as described above I failed so far.)

The code

The code was compiled with

g++ main.cc -O3 -Wall -Wextra -fopenmp

or

icc main.cc -O3 -Wall -Wextra -openmp

Both version show the described problem when run on 2 CPUs with a total of 40 threads. I couldn't observe the bug on 1 CPU (and as many threads as I like).

// File: main.cc
#include <cmath>
#include <iostream>
#include <fstream>
#include <complex>
#include <cassert>
#include <iomanip>
#include <omp.h>

using namespace std;


// If defined: We add some nops in certain places, to get the timing right.
// Without them, I haven't observed the bug.
#define ENABLE_NOPS

// The size of each of the 3 tensors is: GRID_SIZE x GRID_SIZE x GRID_SIZE
static const int GRID_SIZE = 60;

//=============================================
// Produces several nops. Used to get correct "timings".

//----
template<int N> __attribute__((always_inline)) inline void nop()
{
    nop<N-1>();
    asm("nop;");
}

//----
template<> inline void nop<0>() { }

//----
__attribute__((always_inline)) inline void nops()
{
    nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>();
}




//=============================================
/*
Memory layout: We have 3 rank-3-tensors, i.e. 3 arrays of dimension 3.
The layout looks like this: complex<double> allElements[tensorIdx][kappa1][kappa2][kappa3];
The kappas represent the indices into a certain tensor, and are all in the interval [0; GRID_SIZE-1].
*/
class MemoryManagerFFTW
{
public:
  //---------- Constructor ----------
  MemoryManagerFFTW()
  {
    mAllElements = new complex<double>[GetTotalNumElements()];
  }

  //---------- Destructor ----------
  ~MemoryManagerFFTW() 
  { 
    delete[] mAllElements; 
  }

  //---------- SetElement ----------
  void SetElement(int tensorIdx, int kappa1, int kappa2, int kappa3, const complex<double>& newVal)
  {
    // Out-of-bounds error checks are done in this function.
    const size_t idx = CalcElementIdx(tensorIdx, kappa1, kappa2, kappa3);

    // These nops here are important to reproduce the bug.
#if defined(ENABLE_NOPS)
    nops();
    nops();
#endif

    // A flush makes the bug appear more often.
    // #pragma omp flush
    mAllElements[idx] = newVal;

    // This was never false, although the same check is false in DoComputation() from time to time.
    assert(newVal == mAllElements[idx]);
  }

  //---------- GetElement ----------
  const complex<double>& GetElement(int tensorIdx, int kappa1, int kappa2, int kappa3)const
  {  
    const size_t idx = CalcElementIdx(tensorIdx, kappa1, kappa2, kappa3);
    return mAllElements[idx];
  }


  //---------- CalcElementIdx ----------
  size_t CalcElementIdx(int tensorIdx, int kappa1, int kappa2, int kappa3)const
  {
    // We have 3 tensors (index by "tensorIdx"). Each tensor is of rank 3. In memory, they are placed behind each other.
    // tensorStartIdx is the index of the first element in the tensor.
    const size_t tensorStartIdx = GetNumElementsPerTensor() * tensorIdx;

    // Index of the element relative to the beginning of the tensor. A tensor is a 3dim. array of size GRID_SIZE x GRID_SIZE x GRID_SIZE
    const size_t idxInTensor = kappa3 + GRID_SIZE * (kappa2 + GRID_SIZE * kappa1);

    const size_t finalIdx = tensorStartIdx + idxInTensor;
    assert(finalIdx < GetTotalNumElements());

    return finalIdx;
  }


  //---------- GetNumElementsPerTensor & GetTotalNumElements ----------
  size_t GetNumElementsPerTensor()const { return GRID_SIZE * GRID_SIZE * GRID_SIZE; }
  size_t GetTotalNumElements()const { return NUM_TENSORS * GetNumElementsPerTensor(); }



public:
  static const int NUM_TENSORS = 3; // The number of tensors.
  complex<double>* mAllElements; // All tensors. An array [tensorIdx][kappa1][kappa2][kappa3]
};




//=============================================
void DoComputation(MemoryManagerFFTW& mSingleLayerManager)
{
  // Parallize outer loop.
  #pragma omp parallel for
  for (int kappa1 = 0; kappa1 < GRID_SIZE; ++kappa1)
  {
    for (int kappa2 = 0; kappa2 < GRID_SIZE; ++kappa2)
    {
      for (int kappa3 = 0; kappa3 < GRID_SIZE; ++kappa3)
      {    
#ifdef ENABLE_NOPS
        nop<50>();
#endif
        const double k2 = kappa1*kappa1 + kappa2*kappa2 + kappa3*kappa3;
        for (int j = 0; j < 3; ++j)
        {
          // Compute and set new result.
          const complex<double> curElement = mSingleLayerManager.GetElement(j, kappa1, kappa2, kappa3);
          const complex<double> newElement = exp(-k2) * k2 * curElement;

          mSingleLayerManager.SetElement(j, kappa1, kappa2, kappa3, newElement);

          // Check if the results has been set correctly. This is sometimes false, but _not_ always when the result is incorrect.
          const complex<double> test = mSingleLayerManager.GetElement(j, kappa1, kappa2, kappa3);
          if (test != newElement)
            printf("Failure: (%g, %g) != (%g, %g)\n", test.real(), test.imag(), newElement.real(), newElement.imag());
        }
      }
    }
  }
}



//=============================================
int main()
{
  cout << "Max num. threads: " << omp_get_max_threads() << endl;

  // Call DoComputation() once to get a reference-array.
  MemoryManagerFFTW reference;
  for (size_t i = 0; i < reference.GetTotalNumElements(); ++i)
    reference.mAllElements[i] = complex<double>((double)i, (double)i+0.5);
  DoComputation(reference);

  // Call DoComputation() several times, and each time compare the result to the reference.
  const size_t NUM = 1000;
  for (size_t curTry = 0; curTry < NUM; ++curTry)
  {
    MemoryManagerFFTW mSingleLayerManager;
    for (size_t i = 0; i < mSingleLayerManager.GetTotalNumElements(); ++i)
      mSingleLayerManager.mAllElements[i] = complex<double>((double)i, (double)i+0.5);
    DoComputation(mSingleLayerManager);

    // Get the max. difference. This *should* be 0, but isn't from time to time.
    double maxDiff = -1;
    for (size_t i = 0; i < mSingleLayerManager.GetTotalNumElements(); ++i)
    {
      const complex<double> curDiff = mSingleLayerManager.mAllElements[i] - reference.mAllElements[i];
      maxDiff = max(maxDiff, max(curDiff.real(), curDiff.imag()));
    }

    if (maxDiff != 0)
      cout << curTry << "\t" << maxDiff << endl;
    else
      cout << curTry << "\t" << "Is exactly 0" << endl;
  }

  return 0;
}

Edit

As can be seen from the comments and Zboson's answer below, there was a bug in kernel 3.11.10-7. After an update to 3.15.0-1, my problem is gone, and the code works as it should.

like image 273
Sedenion Avatar asked Jun 14 '14 11:06

Sedenion


1 Answers

The problem was due to a bug in Linux Kernel kernel 3.11.10-7. The bug may be due to how the kernel handles invalidating the TLB cache as pointed out by Hristo Iliev. I guessed that the kernel might be the problem because I read that there would be some improvements in Linux Kernel 3.15 for NUMA systems so I figured that the kernel version is important for NUMA systems.

When the OP updated the Linux kernel of his NUMA system to 3.15.0-1 the problem went away.

like image 160
Z boson Avatar answered Sep 17 '22 19:09

Z boson