My OpenCL kernel is slower on faster hardware.. But why?

Question

As I was finishing coding my project for a multicore programming class I came up upon something really weird I wanted to discuss with you.

We were asked to create any program that would show significant improvement in being programmed for a multi-core platform. I’ve decided to try and code something on the GPU to try out OpenCL. I’ve chosen the matrix convolution problem since I’m quite familiar with it (I’ve parallelized it before with open_mpi with great speedup for large images).

So here it is, I select a large GIF file (2.5 MB) [2816X2112] and I run the sequential version (original code) and I get an average of 15.3 seconds.

I then run the new OpenCL version I just wrote on my MBP integrated GeForce 9400M and I get timings of 1.26s in average.. So far so good, it’s a speedup of 12X!!

But now I go in my energy saver panel to turn on the “Graphic Performance Mode” That mode turns off the GeForce 9400M and turns on the Geforce 9600M GT my system has. Apple says this card is twice as fast as the integrated one.

Guess what, my timing using the kick-ass graphic card are 3.2 seconds in average… My 9600M GT seems to be more than two times slower than the 9400M..

For those of you that are OpenCL inclined, I copy all data to remote buffers before starting, so the actual computation doesn’t require roundtrip to main ram. Also, I let OpenCL determine the optimal local-worksize as I’ve read they’ve done a pretty good implementation at figuring that parameter out..

Anyone has a clue?

edit: full source code with makefiles here http://www.mathieusavard.info/convolution.zip

cd gimage
make
cd ../clconvolute
make
put a large input.gif in clconvolute and run it to see results

Umar Arshad · Accepted Answer

The 9400M is integrated to your memory controller whereas the 9600M GT is a discrete card that is connected to your memory controller via PCI-e bus. This means that when you transfer memory to the 9400M it just allocates it into the System RAM. The 9600M on the other hand sends the data over the PCI-e to the dedicated graphics memory on the card. This transfer is what making your benchmark seem slower.

If you would like to compare the performance of the two graphics cards you should use the OpenCL profiling function instead of the clock function you are currently using.

cl_int clGetEventProfilingInfo (cl_event event, cl_profiling_info param_name, size_t param_value_size, void *param_value, size_t *param_value_size_ret)

Pass the function the event that was created when you were enqueueing the Kernel and pass it the CL_PROFILING_COMMAND_START for the second argument to get the starting point of the Kernel in nanoseconds and CL_PROFILING_COMMAND_END to get the ending point of the kernel. Make sure to use this command AFTER the execution of the kernel has finished(the events hold their values until they go out of scope.) You can also get the time it took to transfer the data to the device by applying this function to the events from the enqueueing of the buffer. Here is an example:

        TRACE("Invoking the Kernel")
    cl::vector<cl::Event> matMultiplyEvent;
    cl::NDRange gIndex(32,64);
    cl::NDRange lIndex(16,16);

    err = queueList["GPU"]->enqueueNDRangeKernel(
                                                 matrixMultiplicationKernel, 
                                                 NULL, 
                                                 gIndex, 
                                                 lIndex, 
                                                 &bufferEvent,
                                                 matMultiplyEvent);
    checkErr(err, "Invoke Kernel");


    TRACE("Reading device data into array");
    err = queueList["GPU"]->enqueueReadBuffer(thirdBuff, 
                                              CL_TRUE,
                                              0,
                                              (matSize)*sizeof(float),
                                              testC,
                                              &matMultiplyEvent,
                                              bufferEvent);
    checkErr(err, "Read Buffer");
    matMultiplyEvent[0].wait();
    for (int i = 0; i < matSize; i++) {
        if (i%64 == 0) {
            std::cout << "
";
        }
        std::cout << testC[i] << "	";
    }
    long transferBackStart = bufferEvent[0].getProfilingInfo<CL_PROFILING_COMMAND_START>();
    long transferBackEnd = bufferEvent[0].getProfilingInfo<CL_PROFILING_COMMAND_END>();
    double transferBackSeconds = 1.0e-9 * (double)(transferBackEnd- transferBackStart);

    long matrixStart = matMultiplyEvent[0].getProfilingInfo<CL_PROFILING_COMMAND_START>();
    long matrixEnd = matMultiplyEvent[0].getProfilingInfo<CL_PROFILING_COMMAND_END>();
    double dSeconds = 1.0e-9 * (double)(matrixEnd - matrixStart);

This example uses the C++ wrapper but the concept should be the same.

Hope this helps.

Benjamin Horstman · Answer

I get the same results, and I'm unsure why. My kernel involves very minimal copying to/from (I presend all needed data for all kernel calls, and only return a 512x512 image). It's a raytracer, so the kernel work vastly outweighs the copy back (400+ms to 10ms). Still, the 9600M GT is about 1.5x-2x slower.

According to nVidia's listing, the 9600M GT should have 32 SPs (twice the number of the 9400M). It's presumably clocked higher too.

The 9600M GT does seem faster in some cases, e.g. games. See these links: http://www.videocardbenchmark.net/video_lookup.php?cpu=GeForce+9600M+GT http://www.videocardbenchmark.net/video_lookup.php?cpu=GeForce+9600M+GT

According to ars technica:

Furthermore, an interesting tidbit about Snow Leopard's implementation is revealed by early tests. Though Snow Leopard doesn't seem to enable dual GPUs or on-the-fly GPU switching for machines using the NVIDIA GeForce 9400M chipset—a limitation carried over from Leopard—it does appear that the OS can use both as OpenCL resources simultaneously. So even if you have the 9600M GT enabled on your MacBook Pro, if OpenCL code is encountered in an application, Snow Leopard can send that code to be processed by the 16 GPU cores sitting pretty much dormant in the 9400M. The converse is not true, though—when running a MacBook Pro with just the 9400M enabled, the 9600M GT is shut down entirely to save power, and can't be used as an OpenCL resource.

This seems to be the opposite of what we are seeing. Also, I am explicitly setting up a CL context on only one device at a time.

There are some suggestions in the ars forums that the 9600M GT doesn't support doubles as well, which would explain this problem. I might try to write up a synthetic benchmark to test this hypothesis.

Kendall Hopkins · Answer

I ran into the same issue when I was testing out OpenCL on my MacBook. I believe it's because the GeForce 9400M has a higher bus speed to the main memory bank than the Geforce 9600M GT. So even though the Geforce 9600M GT has much more power than the GeForce 9400M the time required to copy the memory to the GPU is too long to see the benefit of the more powerful GPU on your situation. It could also be caused by inappropriate worker group sizes.

Also I found this site very helpful in my OpenCL experience.

http://www.macresearch.org/opencl

Also I found this site very helpful in my OpenCL experience.

http://www.macresearch.org/opencl

My OpenCL kernel is slower on faster hardware.. But why?

Tags:

opencl

hardware-acceleration

matdumsa

3 Answers

Umar Arshad

Benjamin Horstman

Kendall Hopkins

Recent Activity

Donate For Us

My OpenCL kernel is slower on faster hardware.. But why?

Tags:

opencl

hardware-acceleration

matdumsa

3 Answers

Umar Arshad

Benjamin Horstman

Kendall Hopkins

Related questions

Recent Activity

Donate For Us