OpenCL: Why does the performance differ so greatly between these two cases?

Tags:

opencl

Here's two pieces of code from an OpenCL kernel I'm working on; they display vastly differing run-times.

The code is rather complicated, so I've simplified it right down.

This version runs in under one second:

for (int ii=0; ii<someNumber;ii++)
{
    for (int jj=0; ii<someNumber2;jj++)
    {
        value1 = value2 + value3;
        value1 = value1 * someFunction(a,b,c);
        double nothing = value1;
    }
}

and this version takes around 38 seconds to run:

for (int ii=0; ii<someNumber;ii++)
{
    for (int jj=0; ii<someNumber2;jj++)
    {
        value1 = value2 + value3;
        value1 = value1 * someFunction(a,b,c);
    }
    double nothing = value1;
}

As I say, the code is somewhat more complicated than this (there's lots of other things going on in the loops), but the variable "nothing" really does move from immediately before to immediately after the brace.

I'm very new to OpenCL, and I can't work out what is going on, much less how to fix it. Needless to say, the slow case is actually what I need in my implementation. I've tried messing around with address spaces (all variables here are in __private).

I can only imagine that for some reason the GPU is pushing the variable "value1" off into slower memory when the brace closes. Is this a likely explanation? What can I do?

Thanks in advance!

UPDATE: This runs in under one second too: (but with uncommenting of either line, it reverts to extreme slowness). This is without making any other changes to the loops, and value1 is still declared in the same place as before.

for (int ii=0; ii<someNumber;ii++)
{
    for (int jj=0; ii<someNumber2;jj++)
    {
//        value1 = value2 + value3;
//        value1 = value1 * someFunction(a,b,c);
    }
    double nothing = value1;
}

UPDATE 2: The code was actually nested in another loop like this, with the declaration of value1 as shown:

double value1=0;
for (int kk=0; kk<someNumber3;kk++)
{
    for (int ii=0; ii<someNumber;ii++)
    {
        for (int jj=0; ii<someNumber2;jj++)
        {
            value1 = value2 + value3;
            value1 = value1 * someFunction(a,b,c);
        }
        double nothing = value1;
    }
}

Moving where value1 is declared also gets us back to the speedy case:

for (int kk=0; kk<someNumber3;kk++)
{
    double value1=0;
    for (int ii=0; ii<someNumber;ii++)
    {
        for (int jj=0; ii<someNumber2;jj++)
        {
            value1 = value2 + value3;
            value1 = value1 * someFunction(a,b,c);
        }
        double nothing = value1;
    }
}

It seems OpenCL is an exceedingly tricky art! I still don't really understand what is going on, but at least I know how to fix it now!

856

asked Oct 07 '11 15:10

carthurs

1 Answers

What implementation are you using? I would expect the "double nothing = value1;" to be eliminated as dead code in any of the cases by any reasonable compiler.

170

answered Oct 21 '22 19:10

arsenm

Related questions
                            
                                Fast way to query latest record?
                            
                                Postgres - Slow simple join with where-clause
                            
                                Testing performance of Java project in Eclipse using VisualJVM
                            
                                C#. Does shortening identifier names increase overall run-time performance of an application?
                            
                                Android Emulator so slow [duplicate]
                            
                                MySQL retrieve friends of friends structure and performance
                            
                                Speeding up a (slow) huge wordpress database
                            
                                Optimizing calculation of frequencies of gametes in populations
                            
                                Bypassing the TCP-IP stack
                            
                                Fastest way to cast values to their respective datatypes in Python
                            
                                Fastest/most efficient in App Engine, local file read or memcache hit?
                            
                                C# Winform grid rendering slow on Windows 7
                            
                                How to read large matrix from a csv efficiently in Octave
                            
                                Exactly how "fast" are modern CPUs?
                            
                                java embedded library on-disk key-value database [closed]
                            
                                Compile unsafe Haskell
                            
                                Does standard C++11 guarantee that high_resolution_clock measure real time (non CPU-cycles)?
                            
                                What is the fastest XML parser in PHP?
                            
                                Is multiple .Where() statements in LINQ a performance issue?
                            
                                Is it safe to use SUM() without ISNULL()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With