Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenCL: Why does the performance differ so greatly between these two cases?

Here's two pieces of code from an OpenCL kernel I'm working on; they display vastly differing run-times.

The code is rather complicated, so I've simplified it right down.

This version runs in under one second:

for (int ii=0; ii<someNumber;ii++)
{
    for (int jj=0; ii<someNumber2;jj++)
    {
        value1 = value2 + value3;
        value1 = value1 * someFunction(a,b,c);
        double nothing = value1;
    }
}

and this version takes around 38 seconds to run:

for (int ii=0; ii<someNumber;ii++)
{
    for (int jj=0; ii<someNumber2;jj++)
    {
        value1 = value2 + value3;
        value1 = value1 * someFunction(a,b,c);
    }
    double nothing = value1;
}

As I say, the code is somewhat more complicated than this (there's lots of other things going on in the loops), but the variable "nothing" really does move from immediately before to immediately after the brace.

I'm very new to OpenCL, and I can't work out what is going on, much less how to fix it. Needless to say, the slow case is actually what I need in my implementation. I've tried messing around with address spaces (all variables here are in __private).

I can only imagine that for some reason the GPU is pushing the variable "value1" off into slower memory when the brace closes. Is this a likely explanation? What can I do?

Thanks in advance!

UPDATE: This runs in under one second too: (but with uncommenting of either line, it reverts to extreme slowness). This is without making any other changes to the loops, and value1 is still declared in the same place as before.

for (int ii=0; ii<someNumber;ii++)
{
    for (int jj=0; ii<someNumber2;jj++)
    {
//        value1 = value2 + value3;
//        value1 = value1 * someFunction(a,b,c);
    }
    double nothing = value1;
}

UPDATE 2: The code was actually nested in another loop like this, with the declaration of value1 as shown:

double value1=0;
for (int kk=0; kk<someNumber3;kk++)
{
    for (int ii=0; ii<someNumber;ii++)
    {
        for (int jj=0; ii<someNumber2;jj++)
        {
            value1 = value2 + value3;
            value1 = value1 * someFunction(a,b,c);
        }
        double nothing = value1;
    }
}

Moving where value1 is declared also gets us back to the speedy case:

for (int kk=0; kk<someNumber3;kk++)
{
    double value1=0;
    for (int ii=0; ii<someNumber;ii++)
    {
        for (int jj=0; ii<someNumber2;jj++)
        {
            value1 = value2 + value3;
            value1 = value1 * someFunction(a,b,c);
        }
        double nothing = value1;
    }
}

It seems OpenCL is an exceedingly tricky art! I still don't really understand what is going on, but at least I know how to fix it now!

like image 856
carthurs Avatar asked Oct 07 '11 15:10

carthurs


People also ask

Which is faster OpenCL or CUDA?

A study that directly compared CUDA programs with OpenCL on NVIDIA GPUs showed that CUDA was 30% faster than OpenCL.

What is the difference between OpenCL and CUDA?

Open-source vs commercial Another highly recognized difference between CUDA and OpenCL is that OpenCL is Open-source and CUDA is a proprietary framework of NVIDIA. This difference brings its own pros and cons and the general decision on this has to do with your app of choice.

What is OpenCL used for?

OpenCL is an open-source programming language for cross-platform parallel programming in modern heterogeneous platforms. It can be used develop applications that are portable across devices with varied architectures such as CPU, GPU, field-programmable gate array (FPGA), etc.

What is OpenCL acceleration?

The Hardware Acceleration page lets you access OpenCL settings. OpenCL lets you use graphics card computing resources (GPU) to help boost the overall processing speed of the application. To use OpenCL, you must have a compatible graphics card running the latest driver from the manufacturer's website.


1 Answers

What implementation are you using? I would expect the "double nothing = value1;" to be eliminated as dead code in any of the cases by any reasonable compiler.

like image 170
arsenm Avatar answered Oct 21 '22 19:10

arsenm