I realize that reduction is only usable for POD types in C++. What would you do to implement a reduction for a complex type accumulator?
complex<double> x(0.0,0.0), y(1.0,1.0);
#pragma omp parallel for reduction(+:x)
for(int i=0; i<5; i++)
{
x += y;
}
(noting that I may have left some syntax out). It seems an obvious solution would be to split real and imaginary components into temporary doubles, then accumulate on those. I guess I'm looking for elegance, and that seems ... less than pretty. Would that be the typical approach here?
The typical workaround in absence of user-defined reductions in OpenMP is even uglier than what you suggested. Usually, prior to the parallel region people create an array of (at least) as many elements as there will be threads in the region, accumulate partial results separately for each thread using omp_get_thread_num()
as an index to the array, and do final reduction of the accumulated results in a loop after the parallel region.
As far as I know, OpenMP language committee works on adding user-defined reductions to the specification, so maybe it will be finally resolved in a few years.
Sorry, OpenMP simply doesn't support that at this time. Unfortunately, you need to do parallel reduction in an ugly way what you already described.
However, if such parallel reduction is really frequent, I'd like to make a constructor similar to parallel_reduce
in TBB. Implementation of such construct is fairly straight forward. Cilk plus has a more powerful reducer object, but I didn't check whether it supports non POD.
FYI, such kind of restriction can also be found in threadprivate
pragma. I've tested with VC++ 2008/2010 and Intel compilers (icc). VC++ can't support threadprivate
with a struct/class that has a constructor or destructor (or a scalar variable that requires function call to be initialized), by throwing an error: error C3057, "dynamic initialization of 'threadprivate' symbols". You may read this MSDN link as well. However, icc is okay with the case of C3057. You can see, at least, two major implementations are such different.
I guess that supporting parallel reduction on non-POD would have the similar problem above. In order to support parallel reduction, each parallel section should allocate a thread-local variable for a reduction variable. So, if a given reduction variable is non-POD, they may need to call user-defined constructor.This makes the same problem what I have mentioned in the case of C3057.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With