I originally wrote an OpenCL program to calculate very large hermitian matrices, where the kernel calculates a single pair of entries in the matrix (the upper triangular portion, and its lower triangular complement).
Very early on, I found a very odd problem in that, if my kernel size is exactly 55, the 27th kernel thread would not execute. This problem only occurs when using the nVidia driver and GPU acceleration. When I run it using the Intel driver on the CPU, I find the 27th kernel thread executes just fine. Larger and smaller kernel sizes don't seem to exhibit the problem.
Thinking it might be something in my code, I distilled my problem down to the following very simple kernel:
__kernel void testIndex(__global float* outMatrix, unsigned int sizeN)
{
//k is the linear kernel ID (related to but not exactly the linear index into the outMatrix)
int k = get_global_id(0);
//i'th index (Row or Y)
int i = floor((2 * sizeN+1 - sqrt((float)((2 * sizeN + 1) * (2 * sizeN + 1) -8 * k) )) /2);
//j'th index (Column or X)
int j = k - sizeN * i + i * (i - 1) / 2;
j += i;
//Index bounds check... If we're greater than sizeN, we're an idle core.
//(OpenCL will queue up a fixed block size of worker threads, some of them may be out of bounds)
if(j >= sizeN || i >= sizeN)
{
return;
}
//Identity case. The original kernel did some special stuff here,
//but I've just replaced it with the K index code.
if(i == j)
{
outMatrix[i * sizeN +j] = k;
return;
}
outMatrix[i * sizeN + j] = k;
//Since we only have to calculate the upper triangle of our matrix,
//(the lower triangle is just the complement of the upper),
//this test sets the lower triangle to -9999 so it's easier to see
//how the indexing plays out...
outMatrix[j * sizeN + i] = -9999.0;
}
outMatrix is the output matrix, and sizeN is the size of the square matrix on a side (i.e. the matrix is sizeN x sizeN).
I calculate and execute my kernel size using the following host code:
size_t kernelSize = elems * (elems + 1) / 2;
cl::NDRange globalRange(kernelSize);
cl::NDRange localRange(1);
cl::Event event;
clCommandQueue.enqueueNDRangeKernel(testKernel, cl::NullRange, globalRange, cl::NullRange, NULL, &event);
event.wait();
elems is the same as sizeN (i.e. the square root of the matrix size). In this case, elems = 10 (thus giving a kernel size of 55).
If I print out the matrix that I read back, I get the following (using boost ublas matrix formatting):
[10,10] (( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
((-9999, 10, 11, 12, 13, 14, 15, 16, 17, 18),
((-9999, -9999, 19, 20, 21, 22, 23, 24, 25, 26),
((-9999, -9999, -9999, JUNK, 28, 29, 30, 31, 32, 33),
((-9999, -9999, -9999, -9999, 34, 35, 36, 37, 38, 39),
((-9999, -9999, -9999, -9999, -9999, 40, 41, 42, 43, 44),
((-9999, -9999, -9999, -9999, -9999, -9999, 45, 46, 47, 48),
((-9999, -9999, -9999, -9999, -9999, -9999, -9999, 49, 50, 51),
((-9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, 52, 53),
((-9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, 54))
Where "JUNK" is a random value based on whatever happens to be in that memory at the time. This is of course suspicious, as 27 is is basically the exact halfway point in the kernel.
Just for completeness, the matrix result is read back using the following code:
boost::scoped_array<float> outMatrixReadback(new float[elems * elems]);
clCommandQueue.enqueueReadBuffer(clOutputMatrixBuffer, CL_TRUE, 0, elems * elems * sizeof(float), outMatrixReadback.get());
I am making the (perhaps incorrect) assumption that since the code executes fine on an Intel CPU, that there is not some fundamental bug in the code itself.
So then, is there perhaps some gotcha I'm not aware of when programming OpenCL on an nVidia card, or am I unfortunate enough to have found a driver bug?
Hardware/OS specs
nVidia GTX 770
RHEL Server release 6.4 (Santiago)
Intel OpenCL 1.2 4.4.4.0.134 SDK headers
nVidia GeForce driver 384.69
Intel Xeon CPU E6520 @ 2.4 GHz
CUDA vs OpenCL – two interfaces used in GPU computing and while they both present some similar features, they do so using different programming interfaces.
OpenCL™ (Open Computing Language) is a low-level API for heterogeneous computing that runs on CUDA-powered GPUs. Using the OpenCL API, developers can launch compute kernels written using a limited subset of the C programming language on a GPU.
OpenCL can use CPUs as a compute device just it can for GPUs. There is no local memory, CPUs cache is utilized in OpenCL just like any normal CPU program.
A kernel is essentially a function written in the OpenCL language that enables it to be compiled for execution on any device that supports OpenCL. The kernel is the only way the host can call a function that will run on a device. When the host invokes a kernel, many work items start running on the device.
After discussions with nVidia, this was confirmed to be both repeatable and a driver bug by a technical rep. A bug report was submitted, but unfortunately I was informed nVidia doesn't have a dedicated OpenCL dev team, so a timeline on a fix can't be provided.
Edit: After finally hearing back from nVidia, the workaround is apparently to use pow() instead of sqrt() in the CL kernel, as sqrt() is apparently the source of the bug.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With