Here's two pieces of code from an OpenCL kernel I'm working on; they display vastly differing run-times.
The code is rather complicated, so I've simplified it right down.
This version runs in under one second:
for (int ii=0; ii<someNumber;ii++)
{
for (int jj=0; ii<someNumber2;jj++)
{
value1 = value2 + value3;
value1 = value1 * someFunction(a,b,c);
double nothing = value1;
}
}
and this version takes around 38 seconds to run:
for (int ii=0; ii<someNumber;ii++)
{
for (int jj=0; ii<someNumber2;jj++)
{
value1 = value2 + value3;
value1 = value1 * someFunction(a,b,c);
}
double nothing = value1;
}
As I say, the code is somewhat more complicated than this (there's lots of other things going on in the loops), but the variable "nothing" really does move from immediately before to immediately after the brace.
I'm very new to OpenCL, and I can't work out what is going on, much less how to fix it. Needless to say, the slow case is actually what I need in my implementation. I've tried messing around with address spaces (all variables here are in __private).
I can only imagine that for some reason the GPU is pushing the variable "value1" off into slower memory when the brace closes. Is this a likely explanation? What can I do?
Thanks in advance!
UPDATE: This runs in under one second too: (but with uncommenting of either line, it reverts to extreme slowness). This is without making any other changes to the loops, and value1 is still declared in the same place as before.
for (int ii=0; ii<someNumber;ii++)
{
for (int jj=0; ii<someNumber2;jj++)
{
// value1 = value2 + value3;
// value1 = value1 * someFunction(a,b,c);
}
double nothing = value1;
}
UPDATE 2: The code was actually nested in another loop like this, with the declaration of value1
as shown:
double value1=0;
for (int kk=0; kk<someNumber3;kk++)
{
for (int ii=0; ii<someNumber;ii++)
{
for (int jj=0; ii<someNumber2;jj++)
{
value1 = value2 + value3;
value1 = value1 * someFunction(a,b,c);
}
double nothing = value1;
}
}
Moving where value1
is declared also gets us back to the speedy case:
for (int kk=0; kk<someNumber3;kk++)
{
double value1=0;
for (int ii=0; ii<someNumber;ii++)
{
for (int jj=0; ii<someNumber2;jj++)
{
value1 = value2 + value3;
value1 = value1 * someFunction(a,b,c);
}
double nothing = value1;
}
}
It seems OpenCL is an exceedingly tricky art! I still don't really understand what is going on, but at least I know how to fix it now!
A study that directly compared CUDA programs with OpenCL on NVIDIA GPUs showed that CUDA was 30% faster than OpenCL.
Open-source vs commercial Another highly recognized difference between CUDA and OpenCL is that OpenCL is Open-source and CUDA is a proprietary framework of NVIDIA. This difference brings its own pros and cons and the general decision on this has to do with your app of choice.
OpenCL is an open-source programming language for cross-platform parallel programming in modern heterogeneous platforms. It can be used develop applications that are portable across devices with varied architectures such as CPU, GPU, field-programmable gate array (FPGA), etc.
The Hardware Acceleration page lets you access OpenCL settings. OpenCL lets you use graphics card computing resources (GPU) to help boost the overall processing speed of the application. To use OpenCL, you must have a compatible graphics card running the latest driver from the manufacturer's website.
What implementation are you using? I would expect the "double nothing = value1;" to be eliminated as dead code in any of the cases by any reasonable compiler.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With