I'm running a camera acquisition program that performs processing on acquired images, and I'm using simple OpenMP directives for this processing. So basically I wait for an image from the camera, and then process it.
When migrating to VC2010, I see very strange performance hog : under VC2010 my app is taking nearly 100% CPU while it is taking only 10% under VC2008.
If I benchmark only the processing code I get no difference between VC2010 and VC2008, the difference occurs when using the acquisition functions.
I have reduced the code needed to reproduce the problem to a simple loop that does the following:
for (int i=0; i<1000; ++i)
{
GetImage(buffer);//wait for image
Copy2Array(buffer, my_array);
long long sum = 0;//do some simple OpenMP parallel loop
#pragma omp parallel for reduction(+:sum)
for (int j=0; j<size; ++j)
sum += my_array[j];
}
This loop eats 5% of CPU with 2008, and 70% with 2010.
I've done some profiling, that shows that in 2010 most of the time is spent in OpenMP's vcomp100.dll!_vcomp::PartialBarrierN::Block
I have also done some concurrency profiling:
In 2008, processing work is distributed over 3 worker threads, that are very lightly active as processing time is much inferior as image waiting time
The same threads appear in 2010, but they are all 100% occupied by the PartialBarrierN::Block
function. As I have four cores, they are eating 75% of the work, which is roughly what I see in the CPU occupation.
So it looks like there is a conflict between OpenMP and the Matrox acquisition library (proprietary). But is it a bug of VS2010 or Matrox? Is there anything I can do? Using VC++2010 is mandatory for me, so I cannot just stick with 2008.
Big thanks
Using new concurrency framework, as suggested by DeadMG, leads to 40% CPU. Profiling it shows that time is spent in processing, so it doesn't show the bug I'm seeing with OpenMP, but performance in my case is way poorer than OpenMP.
I have installed an evaluation version of latest Intel C++. It shows exactly the same performance problems!!
I cross-posted to MSDN forum
Tested on Windows 7 64 bits and XP 32 bits, with the exact same results (on the same machinje)
I tested another acquisition board, and the problem is identical, so the culprit is VC++2010. Microsoft made OpenMP implementation changes that screws up programs like mine, as a thread on MSDN forums shows.
With OpenMP 3.0 the spinwait can be deactivated via OMP_WAIT_POLICY
:
_putenv_s( "OMP_WAIT_POLICY", "PASSIVE" );
The effect is basically the same as with kmp_set_blocktime(0)
, but as we set the environment variable OMP_WAIT_POLICY
during runtime, it'll only affect the current process and child processes.
Of course OMP_WAIT_POLICY can also be set by a launcher application, e.g. Blender handles it that way.
A hotfix for VC2010 is available here, later versions like VC2013 support it directly.
You could try the new Concurrency Runtime that ships with VS2010- just starting on your test sample.
That is,
for (int i=0; i<1000; ++i)
{
GetImage(buffer);//wait for image
Copy2Array(buffer, my_array);
long long sum = 0;//do some simple OpenMP parallel loop
#pragma omp parallel for reduction(+:sum)
for (int j=0; j<size; ++j)
sum += my_array[j];
}
would become
for (int i=0; i<1000; ++i)
{
GetImage(buffer);//wait for image
Copy2Array(buffer, my_array);
Concurrency::combinable<int> combint;
Concurrency::parallel_for(0, size / 1000, [&](int j) {
for(int i = 0; i < 1000; i++)
combint.local() += my_array[(j * 1000) + i];
});
combint.combine([](int a, int b) { return a + b; });
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With