Why does my OpenCL kernel fail on the nVidia driver, but not Intel (possible driver bug)?

Tags:

I originally wrote an OpenCL program to calculate very large hermitian matrices, where the kernel calculates a single pair of entries in the matrix (the upper triangular portion, and its lower triangular complement).

Very early on, I found a very odd problem in that, if my kernel size is exactly 55, the 27th kernel thread would not execute. This problem only occurs when using the nVidia driver and GPU acceleration. When I run it using the Intel driver on the CPU, I find the 27th kernel thread executes just fine. Larger and smaller kernel sizes don't seem to exhibit the problem.

Thinking it might be something in my code, I distilled my problem down to the following very simple kernel:

__kernel void testIndex(__global float* outMatrix, unsigned int sizeN)
{
    //k is the linear kernel ID (related to but not exactly the linear index into the outMatrix)
    int k = get_global_id(0);
    //i'th index (Row or Y)
    int i = floor((2 * sizeN+1 - sqrt((float)((2 * sizeN + 1) * (2 * sizeN + 1) -8 * k) )) /2);

    //j'th index (Column or X)
    int j = k - sizeN * i + i * (i - 1) / 2;
    j += i;

    //Index bounds check... If we're greater than sizeN, we're an idle core.
    //(OpenCL will queue up a fixed block size of worker threads, some of them may be out of bounds)
    if(j >= sizeN || i >= sizeN)
    {
        return;
    }

    //Identity case. The original kernel did some special stuff here,
    //but I've just replaced it with the K index code.
    if(i == j)
    {
        outMatrix[i * sizeN +j] = k;
        return;
    }

    outMatrix[i * sizeN + j] = k;

    //Since we only have to calculate the upper triangle of our matrix,
    //(the lower triangle is just the complement of the upper),
    //this test sets the lower triangle to -9999 so it's easier to see
    //how the indexing plays out...

    outMatrix[j * sizeN + i] = -9999.0;

 }

outMatrix is the output matrix, and sizeN is the size of the square matrix on a side (i.e. the matrix is sizeN x sizeN).

I calculate and execute my kernel size using the following host code:

size_t kernelSize = elems * (elems + 1) / 2;
cl::NDRange globalRange(kernelSize);
cl::NDRange localRange(1);
cl::Event event;

clCommandQueue.enqueueNDRangeKernel(testKernel, cl::NullRange, globalRange, cl::NullRange, NULL, &event);
event.wait();

elems is the same as sizeN (i.e. the square root of the matrix size). In this case, elems = 10 (thus giving a kernel size of 55).

If I print out the matrix that I read back, I get the following (using boost ublas matrix formatting):

[10,10] ((    0,     1,     2,     3,     4,     5,     6,     7,     8,    9),
        ((-9999,    10,    11,    12,    13,    14,    15,    16,    17,   18),
        ((-9999, -9999,    19,    20,    21,    22,    23,    24,    25,   26),
        ((-9999, -9999, -9999,  JUNK,    28,    29,    30,    31,    32,   33),
        ((-9999, -9999, -9999, -9999,    34,    35,    36,    37,    38,   39),
        ((-9999, -9999, -9999, -9999, -9999,    40,    41,    42,    43,   44), 
        ((-9999, -9999, -9999, -9999, -9999, -9999,    45,    46,    47,   48),
        ((-9999, -9999, -9999, -9999, -9999, -9999, -9999,    49,    50,   51),    
        ((-9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999,    52,   53),   
        ((-9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999,   54))

Where "JUNK" is a random value based on whatever happens to be in that memory at the time. This is of course suspicious, as 27 is is basically the exact halfway point in the kernel.

Just for completeness, the matrix result is read back using the following code:

boost::scoped_array<float> outMatrixReadback(new float[elems * elems]);
clCommandQueue.enqueueReadBuffer(clOutputMatrixBuffer, CL_TRUE, 0, elems * elems * sizeof(float), outMatrixReadback.get());

I am making the (perhaps incorrect) assumption that since the code executes fine on an Intel CPU, that there is not some fundamental bug in the code itself.

So then, is there perhaps some gotcha I'm not aware of when programming OpenCL on an nVidia card, or am I unfortunate enough to have found a driver bug?

Hardware/OS specs

nVidia GTX 770
RHEL Server release 6.4 (Santiago)
Intel OpenCL 1.2 4.4.4.0.134 SDK headers
nVidia GeForce driver 384.69
Intel Xeon CPU E6520 @ 2.4 GHz

624

asked Sep 25 '17 15:09

stix

1 Answers

After discussions with nVidia, this was confirmed to be both repeatable and a driver bug by a technical rep. A bug report was submitted, but unfortunately I was informed nVidia doesn't have a dedicated OpenCL dev team, so a timeline on a fix can't be provided.

Edit: After finally hearing back from nVidia, the workaround is apparently to use pow() instead of sqrt() in the CL kernel, as sqrt() is apparently the source of the bug.

135

answered Oct 18 '22 23:10

stix

Related questions
                            
                                Calculate eigenvalues/eigenvectors of hundreds of small matrices using CUDA
                            
                                Different ways to make kernel
                            
                                Is there a good openCL wrapper for Ruby?
                            
                                How to draw OpenCL calculated pixels to the screen with OpenGL?
                            
                                Get GPU memory usage programmatically
                            
                                Are OpenCL work items executed in parallel?
                            
                                Can I run Cuda or opencl on intel iris?
                            
                                OpenCL: Work items, Processing elements, NDRange
                            
                                Keep getting CL_INVALID_KERNEL_ARGS on nvidia gpu
                            
                                How To Structure Large OpenCL Kernels?
                            
                                How to use async_work_group_copy in OpenCL?
                            
                                When will OpenCL 1.2 for NVIDIA hardware be available?
                            
                                HPC programming language relying on implicit vectorization
                            
                                Using Delphi to take advantage of GPGPU technology?
                            
                                Number of Compute Units corresponding to the number of work groups
                            
                                Confusion on CUDA/openCL and C++ AMP
                            
                                OpenCL - is it possible to invoke another function from within a kernel?
                            
                                OpenCL for Python
                            
                                How can I test for OpenCL compatibility?
                            
                                How does the Levenberg–Marquardt algorithm work in detail but in an understandable way?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does my OpenCL kernel fail on the nVidia driver, but not Intel (possible driver bug)?

Tags:

gpgpu

nvidia

opencl

stix

People also ask

1 Answers

stix

Recent Activity

Donate For Us