I do some scientific computations in C++, and try to utilize OpenMP for the parallelisation of some of the loops. This worked well so far, e.g. on a Intel i7-4770 with 8 threads.
We have a small workstation which consists of two Intel CPUs (E5-2680v2) on one mainboard. The code works as long as it runs on 1 CPU with as many threads as I like. But as soon as I employ the second CPU, I observe incorrect results from time to time (around every 50th-100th time I run the code). This happens even when I use only 2 threads and assign them to the two different CPUs. As we have 5 of these workstations (all are identical), I ran the code on each one of them, and all show this problem.
The workstation runs on OpenSuse 13.1, kernel 3.11.10-7. The problem exists with g++ 4.8.1 and 4.9.0, and with Intel's icc 13.1.3.192 (albeit the problem doesn't occur that often with icc, but it is still there).
The symptom can be described as follows:
std::complex<double>* mFourierValues;
mFourierValues[idx] = newValue;
mFourierValues[idx] == newValue
, this check fails from time to time (although not every time the results end up being incorrect).So the symptom looks like I access elements concurrently without any synchronizations. However, when I store the indices in a std::vector
(with a proper #pragma omp critical
),
all indicies are unique and in the correct range.
After several days of debugging, my suspicion grows that something else is going on, and that my code is correct. To me it looks like something weird is happening when the CPUs synchronize the caches with the main-memory.
Therefore, my questions are:
[Edit: Old code removed, see below]
OK, I was finally able to produce a shorter (and self-consistent) code example.
complex<double> mAllElements[tensorIdx][kappa1][kappa2][kappa3]
. I.e. I have 3 rank-3-tensors (tensorIdx
). Each tensor represents a 3-dimensional array, indexed by kappa1
, kappa2
and kappa3
.kappa1
loop is the one that gets parallized (and is the outermost one). They are located in DoComputation()
.main()
, I call DoComputation()
once to get some reference values, and then I call it several times and compare the results. They should match exactly, but sometimes they don't.Unfortunately, the code is still around 190 lines long. I tried to simplify it further (only 1 tensor of rank 1, etc.), but then I was never able to reproduce the problem. I guess it appears because the memory-accesses are non-aligned (the loop over tensorIdx
is the innermost one) (I know, this is far from optimal.)
Furthermore, some delays were needed in appropriate places, to reproduce the bug. That is the reason for the nops()
calls. Without them the code runs a lot faster, but so far hasn't shown the problem.
Note that I checked the critical part, CalcElementIdx()
, again, and deem it correct (each element is accessed once). I also ran valgrind's memcheck, helgrind and drd (with proper recompiled libgomp), and all three gave no errors.
Every second to third start of the program I get one or two mismatches. Example output:
41 Is exactly 0
42 Is exactly 0
43 Is exactly 0
44 Is exactly 0
45 348496
46 Is exactly 0
47 Is exactly 0
48 Is exactly 0
49 Is exactly 0
This is true for gcc and icc.
My question is: Does the code below look correct to you? (Apart from obvious design flaws.) (If it is too long, I will try to reduce it further, but as described above I failed so far.)
The code was compiled with
g++ main.cc -O3 -Wall -Wextra -fopenmp
or
icc main.cc -O3 -Wall -Wextra -openmp
Both version show the described problem when run on 2 CPUs with a total of 40 threads. I couldn't observe the bug on 1 CPU (and as many threads as I like).
// File: main.cc
#include <cmath>
#include <iostream>
#include <fstream>
#include <complex>
#include <cassert>
#include <iomanip>
#include <omp.h>
using namespace std;
// If defined: We add some nops in certain places, to get the timing right.
// Without them, I haven't observed the bug.
#define ENABLE_NOPS
// The size of each of the 3 tensors is: GRID_SIZE x GRID_SIZE x GRID_SIZE
static const int GRID_SIZE = 60;
//=============================================
// Produces several nops. Used to get correct "timings".
//----
template<int N> __attribute__((always_inline)) inline void nop()
{
nop<N-1>();
asm("nop;");
}
//----
template<> inline void nop<0>() { }
//----
__attribute__((always_inline)) inline void nops()
{
nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>();
}
//=============================================
/*
Memory layout: We have 3 rank-3-tensors, i.e. 3 arrays of dimension 3.
The layout looks like this: complex<double> allElements[tensorIdx][kappa1][kappa2][kappa3];
The kappas represent the indices into a certain tensor, and are all in the interval [0; GRID_SIZE-1].
*/
class MemoryManagerFFTW
{
public:
//---------- Constructor ----------
MemoryManagerFFTW()
{
mAllElements = new complex<double>[GetTotalNumElements()];
}
//---------- Destructor ----------
~MemoryManagerFFTW()
{
delete[] mAllElements;
}
//---------- SetElement ----------
void SetElement(int tensorIdx, int kappa1, int kappa2, int kappa3, const complex<double>& newVal)
{
// Out-of-bounds error checks are done in this function.
const size_t idx = CalcElementIdx(tensorIdx, kappa1, kappa2, kappa3);
// These nops here are important to reproduce the bug.
#if defined(ENABLE_NOPS)
nops();
nops();
#endif
// A flush makes the bug appear more often.
// #pragma omp flush
mAllElements[idx] = newVal;
// This was never false, although the same check is false in DoComputation() from time to time.
assert(newVal == mAllElements[idx]);
}
//---------- GetElement ----------
const complex<double>& GetElement(int tensorIdx, int kappa1, int kappa2, int kappa3)const
{
const size_t idx = CalcElementIdx(tensorIdx, kappa1, kappa2, kappa3);
return mAllElements[idx];
}
//---------- CalcElementIdx ----------
size_t CalcElementIdx(int tensorIdx, int kappa1, int kappa2, int kappa3)const
{
// We have 3 tensors (index by "tensorIdx"). Each tensor is of rank 3. In memory, they are placed behind each other.
// tensorStartIdx is the index of the first element in the tensor.
const size_t tensorStartIdx = GetNumElementsPerTensor() * tensorIdx;
// Index of the element relative to the beginning of the tensor. A tensor is a 3dim. array of size GRID_SIZE x GRID_SIZE x GRID_SIZE
const size_t idxInTensor = kappa3 + GRID_SIZE * (kappa2 + GRID_SIZE * kappa1);
const size_t finalIdx = tensorStartIdx + idxInTensor;
assert(finalIdx < GetTotalNumElements());
return finalIdx;
}
//---------- GetNumElementsPerTensor & GetTotalNumElements ----------
size_t GetNumElementsPerTensor()const { return GRID_SIZE * GRID_SIZE * GRID_SIZE; }
size_t GetTotalNumElements()const { return NUM_TENSORS * GetNumElementsPerTensor(); }
public:
static const int NUM_TENSORS = 3; // The number of tensors.
complex<double>* mAllElements; // All tensors. An array [tensorIdx][kappa1][kappa2][kappa3]
};
//=============================================
void DoComputation(MemoryManagerFFTW& mSingleLayerManager)
{
// Parallize outer loop.
#pragma omp parallel for
for (int kappa1 = 0; kappa1 < GRID_SIZE; ++kappa1)
{
for (int kappa2 = 0; kappa2 < GRID_SIZE; ++kappa2)
{
for (int kappa3 = 0; kappa3 < GRID_SIZE; ++kappa3)
{
#ifdef ENABLE_NOPS
nop<50>();
#endif
const double k2 = kappa1*kappa1 + kappa2*kappa2 + kappa3*kappa3;
for (int j = 0; j < 3; ++j)
{
// Compute and set new result.
const complex<double> curElement = mSingleLayerManager.GetElement(j, kappa1, kappa2, kappa3);
const complex<double> newElement = exp(-k2) * k2 * curElement;
mSingleLayerManager.SetElement(j, kappa1, kappa2, kappa3, newElement);
// Check if the results has been set correctly. This is sometimes false, but _not_ always when the result is incorrect.
const complex<double> test = mSingleLayerManager.GetElement(j, kappa1, kappa2, kappa3);
if (test != newElement)
printf("Failure: (%g, %g) != (%g, %g)\n", test.real(), test.imag(), newElement.real(), newElement.imag());
}
}
}
}
}
//=============================================
int main()
{
cout << "Max num. threads: " << omp_get_max_threads() << endl;
// Call DoComputation() once to get a reference-array.
MemoryManagerFFTW reference;
for (size_t i = 0; i < reference.GetTotalNumElements(); ++i)
reference.mAllElements[i] = complex<double>((double)i, (double)i+0.5);
DoComputation(reference);
// Call DoComputation() several times, and each time compare the result to the reference.
const size_t NUM = 1000;
for (size_t curTry = 0; curTry < NUM; ++curTry)
{
MemoryManagerFFTW mSingleLayerManager;
for (size_t i = 0; i < mSingleLayerManager.GetTotalNumElements(); ++i)
mSingleLayerManager.mAllElements[i] = complex<double>((double)i, (double)i+0.5);
DoComputation(mSingleLayerManager);
// Get the max. difference. This *should* be 0, but isn't from time to time.
double maxDiff = -1;
for (size_t i = 0; i < mSingleLayerManager.GetTotalNumElements(); ++i)
{
const complex<double> curDiff = mSingleLayerManager.mAllElements[i] - reference.mAllElements[i];
maxDiff = max(maxDiff, max(curDiff.real(), curDiff.imag()));
}
if (maxDiff != 0)
cout << curTry << "\t" << maxDiff << endl;
else
cout << curTry << "\t" << "Is exactly 0" << endl;
}
return 0;
}
As can be seen from the comments and Zboson's answer below, there was a bug in kernel 3.11.10-7. After an update to 3.15.0-1, my problem is gone, and the code works as it should.
The problem was due to a bug in Linux Kernel kernel 3.11.10-7. The bug may be due to how the kernel handles invalidating the TLB cache as pointed out by Hristo Iliev. I guessed that the kernel might be the problem because I read that there would be some improvements in Linux Kernel 3.15 for NUMA systems so I figured that the kernel version is important for NUMA systems.
When the OP updated the Linux kernel of his NUMA system to 3.15.0-1 the problem went away.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With