As of now, my GPU is slower than my CPU when it comes to kernel execution time. I thought maybe since I was testing with a small sample, the CPU ended up finishing faster because of a smaller startup overhead. However, when I tested the kernel with data almost 10 times the size of the sample, the CPU was still finishing faster and the GPU was almost 400ms behind.
Runtime with 2.39MB file CPU: 43.511ms GPU: 65.219ms
Runtime with 32.9MB file CPU: 289.541ms GPU: 605.400ms
I tried using local memory, although I'm 100% sure I was using it wrong, and ran into two issues. The kernel finishes anywhere between 1000-3000ms (depending on what size I set for localWorkSize) or I run into a status code of -5, which is CL_OUT_OF_RESOURCES.
Here is the kernel that a fellow SO member helped me out with.
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) {
int globalId = get_global_id(0);
float sum=0.0f;
for (int i=0; i< 65; i++)
{
float tmp=0;
if (globalId+i > 63)
{
tmp=Array[i+globalId-64]*coefficients[64-i];
}
sum += tmp;
}
Output[globalId]=sum;
}
This was my attempt at using local memory. First bit will be a snippet from the host code and the following portion is the kernel.
//Set the size of localMem
status |= clSetKernelArg(
kernel,
2,
1024, //I had num_items*(float) but it gave me a -5. Num items is the amount of elements in my array (around 1.2 million elements)
null);
printf("Kernel Arg output status: %i \n", status);
//set a localWorkSize
localWorkSize[0] = 64;
//execute the kernel with localWorkSize included
status = clEnqueueNDRangeKernel(
cmdQueue,
kernel,
1,
NULL,
globalWorkSize,
localWorkSize,
0,
NULL,
&someEvent);
//Here is what I did to the kernel***************************************
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output, __local float *localMem) {
int globalId = get_global_id(0);
int localId = get_local_id(0);
localMem[localId] = globalId[globalId];
float sum=0.0f;
for (int i=0; i< 65; i++)
{
float tmp=0;
if (globalId+i > 63)
{
tmp=localMem[i+localId-64]*coefficients[64-i];
}
sum += tmp;
}
Output[globalId]=sum;
}
Reference link I used when trying to set local variables: How do I use local memory in OpenCL?
Link used to find kernelWorkGroupSize (this is why I have 1024 set in the kernelArg): CL_OUT_OF_RESOURCES for 2 millions floats with 1GB VRAM?
I've seen other people have similar problems where the GPU is slower than the CPU but for many of them, they are using clEnqueueKernel instead of clEnqueueNDRangeKernel.
Heres my previous question if you need more info on this kernel: Best approach to FIFO implementation in a kernel OpenCL
Found some optimization tricks for GPU's aswell. https://developer.amd.com/wordpress/media/2012/10/Optimizations-ImageConvolution1.pdf
Edited code; Error still exists
__kernel void lowpass2(__global float *Array, __global float *coefficients, __global float *Output) {
int globalId = get_global_id(0);
float sum=0.0f;
float tmp=0.0f;
for (int i=64-globalId; i< 65; i++)
{
tmp = 0.0f;
tmp=Array[i]*coefficients[i];
sum += tmp;
}
Output[globalId]=sum;
}
Running the following kernel for 24 million element arrays
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) {
int globalId = get_global_id(0);
float sum=0.0f;
for (int i=0; i< 65; i++)
{
float tmp=0;
if (globalId+i > 63)
{
tmp=Array[i+globalId-64]*coefficients[64-i];
}
sum += tmp;
}
Output[globalId]=sum;
}
is completed under 200 ms for a 25 compute unit device pool but over 500 ms for a 8 core cpu.
Either you have a high-end cpu and a low-end gpu or the gpu driver has been gimped or gpu's pci-e interface is stuck at pci-e 1.1 @ 4x bandwidth so array copies between host and device is limited.
On the other hand, this optimized version:
__kernel void lowpass(__global __read_only float *Array,__constant float *coefficients, __global __write_only float *Output) {
int globalId = get_global_id(0);
float sum=0.0f;
int min_i= max(64,globalId)-64;
int max_i= min_i+65;
for (int i=min_i; i< max_i; i++)
{
sum +=Array[i]*coefficients[globalId-i];
}
Output[globalId]=sum;
}
has under 150 ms for cpu(8 compute unit) and under 80ms for gpu(25 compute unit) compute times. Work per item is only 65 times. This low number of operations could be very easily accelerated using __constant and __read_only and __write_only parameter specifiers and some integer work reducing.
Using float4 instead of float type for Array and Output should increase speed by %80 for both your cpu and gpu since those are SIMD type and vector compute units.
Bottlenecks of this kernel are:
Generally:
Also:
Which also has 83 GB/s L1 and L2 cache bandwidths so just let it work on __global driver optimizations instead of LDS unless you plan on upgrading your computer.(for Array ofcourse) Maybe, odd elements from LDS, even elements from __global could have 83+83 = 166 GB/s bandwidth. You can try. Maybe two by two is better than alternating in terms of bank conflicts.
Using coefficients as __constant (166 GB/s) and Array as __global should give you 166 + 83 = 249 GB/s combined bandwidth.
Each coefficient element is used for only once per thread so I'm not suggesting to use private registers (499 GB/s)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With