I am trying to use opencl for the first time, the goal is to calculate the argmin of each row in an array. Since the operation on each row is independent of the others, I thought this would be easy to put on the graphics card.
I seem to get worse performance using this code than when i just run the code on the cpu with an outer forloop, any help would be appreciated.
Here is the code:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
int argmin(global double *array, int end)
{
double minimum = array[0];
int index;
for (int j = 0; j < end; j++)
{
if (array[j] < minimum)
{
minimum = array[j];
index = j;
}
}
return index;
}
kernel void execute(global double *dist, global long *res, global double *min_dist)
{
int row_size = 0;
int i = get_global_id(0);
int row_index = i * row_size;
res[i] = argmin(&dist[row_index], row_size);
min_dist[i] = dist[res[i] + row_index];
}
The commenters make some valid points, but I'll try to be a little more constructive and organised:
double
precision floating point values. Depending on your GPU, this can be bad news in itself. Consumer grade GPUs typically are not optimised for working with double
s, often only achieving 1/32 or 1/16 the throughput compared to single-precision float
operations. Many pro-grade GPUs (Quadro, Tesla, FirePro, some Radeon Pro cards) are fine with them though, achieving 1/2 or 1/4 throughput versus float
. As you're only performing a trivial arithmetic operation (comparison), and there's a good chance your runtime is dominated by memory access, it could be fine on consumer hardware too.row_size
is not actually 0, it would help to know what the true (typical) value is, and whether it's fixed, variable by row, or variable per run but the same for each row. In any case, unless row_size
is very small, the fact that you are running a serial for
loop over it could be holding your code back.Improvement suggestions:
row_size
is very small (and fixed), try to split your loop across multiple work items and coordinate using local memory (reduction algorithms) and atomic operations in global memory.double
s on consumer hardware is too slow, either check if you can generate the data as float
s without loss of accuracy (this will also halve your memory bandwidth woes), or check if you can otherwise do better somehow, for example by treating the double
as a long
and manually unpacking and comparing the exponent and mantissa using integer operations.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With