I have a fairly complex program that runs into strange behavior when build with OpenMP in MSVC 2010 Debug mode. I have tried my best to construct the following minimal working example (though it is not really minimal) which minic the structure of the real program.
#include <vector>
#include <cassert>
// A class take points to the whole collection and a position Only allow access
// to the elements at that posiiton. It provide read-only access to query some
// information about the whole collection
class Element
{
public :
Element (int i, std::vector<double> *src) : i_(i), src_(src) {}
int i () const {return i_;}
int size () const {return src_->size();}
double src () const {return (*src_)[i_];}
double &src () {return (*src_)[i_];}
private :
const int i_;
std::vector<double> *const src_;
};
// A Base class for dispatch
template <typename Derived>
class Base
{
protected :
void eval (int dim, Element elem, double *res)
{
// Dispatch the call from Evaluation<Derived>
eval_dispatch(dim, elem, res, &Derived::eval); // Point (2)
}
private :
// Resolve to Derived non-static member eval(...)
template <typename D>
void eval_dispatch(int dim, Element elem, double *res,
void (D::*) (int, Element, double *))
{
#ifndef NDEBUG // Assert that this is a Derived object
assert((dynamic_cast<Derived *>(this)));
#endif
static_cast<Derived *>(this)->eval(dim, elem, res);
}
// Resolve to Derived static member eval(...)
void eval_dispatch(int dim, Element elem, double *res,
void (*) (int, Element, double *))
{
Derived::eval(dim, elem, res); // Point (3)
}
// Resolve to Base member eval(...), Derived has no this member but derived
// from Base
void eval_dispatch(int dim, Element elem, double *res,
void (Base::*) (int, Element, double *))
{
// Default behavior: do nothing
}
};
// A middle-man who provides the interface operator(), call Base::eval, and
// Base dispatch it to possible default behavior or Derived::eval
template <typename Derived>
class Evaluator : public Base<Derived>
{
public :
void operator() (int N , int dim, double *res)
{
std::vector<double> src(N);
for (int i = 0; i < N; ++i)
src[i] = i;
#pragma omp parallel for default(none) shared(N, dim, src, res)
for (int i = 0; i < N; ++i) {
assert(i < N);
double *r = res + i * dim;
Element elem(i, &src);
assert(elem.i() == i); // Point (1)
this->eval(dim, elem, r);
}
}
};
// Client code, who implements eval
class Implementation : public Evaluator<Implementation>
{
public :
static void eval (int dim, Element elem, double *r)
{
assert(elem.i() < elem.size()); // This is where the program fails Point (4)
for (int d = 0; d != dim; ++d)
r[d] = elem.src();
}
};
int main ()
{
const int N = 500000;
const int Dim = 2;
double *res = new double[N * Dim];
Implementation impl;
impl(N, Dim, res);
delete [] res;
return 0;
}
The real program does not have vector
etc. But the Element
, Base
, Evaluator
and Implementation
captures the basic structure of the real program. When build in Debug mode, and running the debugger, the assertion fails at Point (4)
.
Here is some more details of the debug informations, by viewing the call stacks,
At entering Point (1)
, the local i
has value 371152
, which is fine. The variable elem
does not shown up in the frame, which is a little strange. But since the assertion at Point (1)
does not faile, I guess it is fine.
Then, crazy things happened. The call to eval
by Evaluator
resolves to its base class, and so Point (2)
was exectuted. At this point, the debugers shows that the elem
has i_ = 499999
, which is no longer the i
used to create elem
in Evaluator
before passing it by value to Base::eval
. The next point, it resolves to Point (3)
, this time, elem
has i_ = 501682
, which is out of range, and this is the value when the call is directed to Point (4)
and failed the assertion.
It looks like whenever Element
object is passed by value, the value of its members are changed. Rerun the program multiple times, similar behaviors happens though not always reproducible. In the real program, this class is designed to like an iterator, which iterate over a collection of particles. Though the thing it iterate is not exaclty like a container. But anyway, the point is that it is small enough to be efficiently passed by value. And therefore, the client code, knows that it has its own copy of Element
instead of some reference or pointer, and does not need to worry about thread-safe (much) as long as he sticks with Element
's interface, which only provide write access to a single position of the whole collection.
I tried the same program with GCC and Intel ICPC. Nothing un-expected happens. And in the real program, correct results where produced.
Did I used OpenMP wrongly somewhere? I thought that the elem
created at about Point (1)
shall be local to the loop body. In addition, in the whole program, no value bigger than N
was produced, so where does the those new value comes from?
Edit
I looked more carefully into the debugger, it shows that while elem.i_
was changed when elem
was passed by value, the pointer elem.src_
does not change with it. It has the same value (of the memory address) after passed by value
Edit: Compiler flags
I used CMake to generate the MSVC solution. I have to confess I have no idea how to use MSVC or Windows in general. The only reason I am using it is that I know a lot of people use it so I want to test my library against it to workaround any problems.
The CMake generated project, using Visual Studio 10 Win64
target, the compiler flags appears to be
/DWIN32 /D_WINDOWS /W3 /Zm1000 /EHsc /GR /D_DEBUG /MDd /Zi /Ob0 /Od /RTC1
And here is the command line found in Property Pages-C/C++-Command Line
/Zi /nologo /W3 /WX- /Od /Ob0 /D "WIN32" /D "_WINDOWS" /D "_DEBUG" /D "CMAKE_INTDIR=\"Debug\"" /D "_MBCS" /Gm- /EHsc /RTC1 /MDd /GS /fp:precise /Zc:wchar_t /Zc:forScope /GR /openmp /Fp"TestOMP.dir\Debug\TestOMP.pch" /Fa"Debug" /Fo"TestOMP.dir\Debug\" /Fd"C:/Users/Yan Zhou/Dropbox/Build/TestOMP/build/Debug/TestOMP.pdb" /Gd /TP /errorReport:queue
Is there anything suspecious here?
MSVC compiles OpenMP applications against its own runtime that is built around the Win32 ThreadPool API. Both most likely don't play nice. It is only safe to use the parallel MKL driver with OpenMP code built using Intel C/C++/Fortran compilers.
In Visual Studio 2019 version 16.9 the -openmp:llvm switch only works on the x64 architecture. The new switch currently supports all the same OpenMP 2.0 directives as -openmp, as well as support for unsigned integer indices in parallel for loops according to the OpenMP 3.0 standard. Support for more directives will be added in future releases.
In the initial release of Visual Studio 2019 we added the -openmp:experimental switch to enable minimal support for the OpenMP SIMD directive first introduced in the OpenMP 4.0 standard. Starting with Visual Studio 2019 version 16.9 we have begun adding experimental support for newer versions of the OpenMP standard in a more systematic way.
Just that a single file containing the code in the question and compiled with openmp flag. And this test program fails the same way as the real program. In the real program, only header only libraries was used. So basically all program was compiled at the time with the same flag. The only thing compiled at another time is the MSVC openmp runtime
Apparently the 64-bit OpenMP implementation in MSVC is not compatible with code, compiled without optimisations.
To debug your issue, I've modified your code to save the iteration number to a threadprivate
global variable just before the call to this->eval()
and then added a check at the beginning of Implementation::eval()
to see if the saved iteration number differs from elem.i_
:
static int _iter;
#pragma omp threadprivate(_iter)
...
#pragma omp parallel for default(none) shared(N, dim, src, res)
for (int i = 0; i < N; ++i) {
assert(i < N);
double *r = res + i * dim;
Element elem(i, &src);
assert(elem.i() == i); // Point (1)
_iter = i; // Save the iteration number
this->eval(dim, elem, r);
}
}
...
...
static void eval (int dim, Element elem, double *r)
{
// Check for difference
if (elem.i() != _iter)
printf("[%d] _iter=%x != %x\n", omp_get_thread_num(), _iter, elem.i());
assert(elem.i() < elem.size()); // This is where the program fails Point (4)
for (int d = 0; d != dim; ++d)
r[d] = elem.src();
}
...
It appears that randomly the value of elem.i_
becomes a bad mixture of the values passed in the different threads to void eval_dispatch(int dim, Element elem, double *res, void (*) (int, Element, double *))
. This happens hunderds of times in each run but you only see it once the value of elem.i_
becomes large enough to trigger the assertion. Sometimes it happens that the mixed value does not exceed the size of the container and then the code completes execution without assertion. Also what you see during the debug session after the assertion is the inability of the VS debugger to cope properly with the multithreaded code :)
This only happens in unoptimised 64-bit mode. It doese not happen in 32-bit code (both debug and release). It also does not happen in 64-bit release code unless optimisations are disabled. It also does not happen if one puts the call to this->eval()
in a critical section:
#pragma omp parallel for default(none) shared(N, dim, src, res)
for (int i = 0; i < N; ++i) {
...
#pragma omp critical
this->eval(dim, elem, r);
}
}
but doing this would cancel the benefits of OpenMP. This shows that something further down the call chain is performed in unsafe way. I've examined the assembly code but couldn't find the exact reason. I'm really puzzled since MSVC implements the implicit copy constructor of the Element
class using simple bitwise copy (it is even inline) and all operations are done on the stack.
This reminds me of the fact that the Sun's (now Oracle's) compiler insists that it should crank up the level of optimisation if one enables OpenMP support. Unfortunately the documentation of /openmp
option in MSDN says nothing about possible intereference that might come from the "wrong" optimisation level. This also might be a bug. I should test with another version of VS if I can access one.
Edit: I dug deeper as promised and run the code in Intel Parallel Inspector 2011. It found one data race pattern as expected. Apparently when this line is executed:
this->eval(dim, elem, r);
a temporary copy of elem
is created and passed by address to the eval()
method as is required by the Windows x64 ABI. And here comes the strange thing: the location of this temporary copy is not on the stack of the funclet that implements the parallel region (MSVC compiler calles it Evaluator$omp$1<Implementation>::operator()
by the way) as one would expect but rather its address is taken as the first argument of the funclet. As this argument is one and the same in all threads, it means that the temporary copy that gets further passed to this->eval()
is actually shared among all threads, which is ridiculous but is still true as one can easily observe:
...
void eval (int dim, Element elem, double *res)
{
printf("[%d] In Base::eval() &elem = %p\n", omp_get_thread_num(), &elem);
// Dispatch the call from Evaluation<Derived>
eval_dispatch(dim, elem, res, &Derived::eval); // Point (2)
}
...
...
#pragma omp parallel for default(none) shared(N, dim, src, res)
for (int i = 0; i < N; ++i) {
...
Element elem(i, &src);
...
printf("[%d] In parallel region &elem = %p\n", omp_get_thread_num(), &elem);
this->eval(dim, elem, r);
}
}
...
Running this code produces an output similar to this:
[0] Parallel region &elem = 000000000030F348 (a)
[0] Base::eval() &elem = 000000000030F630
[0] Parallel region &elem = 000000000030F348 (a)
[0] Base::eval() &elem = 000000000030F630
[1] Parallel region &elem = 000000000292F9B8 (b)
[1] Base::eval() &elem = 000000000030F630 <---- !!
[1] Parallel region &elem = 000000000292F9B8 (b)
[1] Base::eval() &elem = 000000000030F630 <---- !!
As expected elem
has different addresses in each thread executing the parallel region (points (a)
and (b)
). But observe that the temporary copy that gets passed to Base::eval()
has the same address in each thread. I believe that this is a compiler bug that makes the implicit copy constructor of Element
use a shared variable. This could be easily validated by looking at the address passed to Base::eval()
- it lies somewhere between the address of N
and that of src
, i.e. in the shared variables block. Further inspection of the assembly source reveals that indeed the address of the temporary place is passed as argument to the _vcomp_fork()
function from vcomp100.dll
that implements the fork part of the OpenMP fork/join model.
Since basically there are no compiler options that can influence this behaviour apart from enabling optimisations which leads to Base::eval()
, Base::eval_dispatch()
, and Implementation::eval()
all being inlined and hence no temporary copies of elem
are ever made, the only work-arounds that I have found are:
1) Make the Element elem
argument to Base::eval()
a reference:
void eval (int dim, Element& elem, double *res)
{
eval_dispatch(dim, elem, res, &Derived::eval); // Point (2)
}
This ensures that the local copy of elem
in the stack of the funclet that implements the parallel region in Evaluator<Implementation>::operator()
is passed and not the shared temporary copy. This gets further passed by value as another temporary copy to Base::eval_dispatch()
but it retains its correct value as this new temporary copy is in the stack of Base::eval()
and not in the shared variables block.
2) Provide an explicit copy constructor to Element
:
Element (const Element& e) : i_(e.i_), src_(e.src_) {}
I would recommend that you go with the explicit copy constructor as it does not require further changes in the source code.
Apparently this behaviour is also present in MSVS 2008. I would have to check if it is also present in MSVS 2012 and possibly file a bug report with MS.
This bug does not show in 32-bit code as there the whole value of each passed by value object is pushed on the call stack and not only a pointer to it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With