Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA vs OpenCL performance comparison

Tags:

cuda

opencl

I am using CUDA 6.0 and the OpenCL implementation that comes bundled with the CUDA SDK. I have two identical kernels for each platform (they differ in the platform specific keywords). They only read and write global memory, each thread different location. The launch configuration for CUDA is 200 blocks of 250 threads (1D), which corresponds directly to the configuration for OpenCL - 50,000 global work size and 250 local work size.

The OpenCL code runs faster. Is this possible or am I timing it wrong? My understanding is that the NVIDIA's OpenCL implementation is based on the one for CUDA. I get around 15% better performance with OpenCL.

It would be great if you could suggest why I might be seeing this and perhaps some differences between CUDA and OpenCL as implemented by NVIDIA?

like image 986
user1096294 Avatar asked May 06 '14 15:05

user1096294


People also ask

Is OpenCL faster than CUDA?

Developers cannot directly implement proprietary hardware technologies like inline Parallel Thread Execution (PTX) on NVIDIA GPUs without sacrificing portability. A study that directly compared CUDA programs with OpenCL on NVIDIA GPUs showed that CUDA was 30% faster than OpenCL.

Which is better for rendering CUDA or OpenCL?

The general consensus is that if your app of choice supports both CUDA and OpenCL, go with CUDA as it will generate better performance results. The main reason for this is that Nvidia provide top quality support to app developers who choose to use CUDA acceleration, therefore the integration is always fantastic.

Is OpenCL faster?

The OpenCL package has a nice test set that compares its own output against the reference project. On the two simplest test cases, OpenCL runs about 14 and 24 times as fast as on the CPU.

What is OpenCL performance?

A major benefit of using OpenCL is that the same kernel can be easily executed on different platforms. With OpenCL, it is easy to dynamically decide which device to use at run-time. OpenCL applications that select a compute device between CPUs and GPUs at run-time can be easily implemented.


1 Answers

Kernels executing on a modern GPU are almost never compute bound, and are almost always memory bandwidth bound. (Because there are so many compute cores running compared to the available path to memory.)

This means that the performance of a given kernel usually depends largely on the memory access patterns exhibited by the given algorithm.

In practice this makes it very difficult to predict (or even understand) what performance to expect ahead of time.

The differences you observed are likely due to subtle differences in the memory access patterns between the two kernels that result from different optimizations made by the OpenCL vs CUDA toolchain.

To learn how to optimize your GPU kernels it pays to learn the details of the memory caching hardware available to you, and how to use it to best advantage. (e.g., making strategic use of "local" memory caches vs always going directly to "global" memory in OpenCL.)

like image 68
Shebla Tsama Avatar answered Sep 28 '22 04:09

Shebla Tsama