While measuring performance of the same kernel on CUDA and OpenCL, I've found one weird thing.
When I leave my kernel absolutely empty, without any input parameters and calculations, CUDA gives me very poor performance, in comparison to OpenCL.
CUDA kernel:
__global__ void kernel_empty()
{
}
CUDA host:
kernel_empty<<<dim3(10000, 10000, 1), dim3(8, 8, 1)>>>();
OpenCl kernel:
__attribute__((reqd_work_group_size(8, 8, 1)))
__kernel void kernel_empty()
{
}
OpenCL host:
cl_event perf_event;
size_t global_work_offset[3] = {0, 0, 0};
size_t global_work_size[3] = {10000, 10000, 1};
size_t local_work_size[3] = {8, 8, 1};
clEnqueueNDRangeKernel(queue, kernel, 3, global_work_offset, global_work_size, local_work_size, 0, NULL, &perf_event);
OpenCL gives 6ms
CUDA gives 390ms
clGetEventProfilingInfo is used.cudaEventElapsedTime is used.Could someone explain why there is such huge difference?
The way in which kernels are launched in OpenCL and CUDA is different, and so actually you are launching different amounts of work for each method.
In OpenCL, you specify the global work size (total amount of work-items to launch), and the local work size (the work-group size). In your example, you are launching 10000*10000 work-items in groups of 8x8.
In CUDA, you specify the block size (analogous to work-group size), and the grid size, which is how many blocks to launch. This means that your CUDA example is launching 10000x10000 blocks, which is a total of 80000x80000 CUDA threads.
So, this CUDA kernel launch:
kernel_empty<<<dim3(10000, 10000, 1), dim3(8, 8, 1)>>>();
is equivalent to this OpenCL kernel enqueue:
size_t global_work_size[3] = {80000, 80000, 1};
size_t local_work_size[3] = {8, 8, 1};
clEnqueueNDRangeKernel(queue, kernel, 3, NULL, global_work_size, local_work_size, 0, NULL, &perf_event);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With