Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA Kernels Randomly Fail, but only when I use certain transcendental functions

Tags:

cuda

I've been working on a CUDA program, that randomly crashes with a unspecified launch failure, fairly frequently. Through careful debugging, I localized which kernel was failing, and furthermore that the failure occurred only if certain transcendental functions were called from within the CUDA kernel, (e.g. sinf() or atanhf()).

This led me to write a much simpler program (see below), to confirm that these transcendental functions really were causing an issue, and it looks like that is indeed the case. When I compile and run the code below, which just has repeated calls to a kernel that uses tanh and atanh, repeatedly, sometimes the program works, and sometimes it prints Error with Kernel along with a message from the driver that says:

NVRM: XiD (0000:01:00): 13, 0002 000000 000050c0 00000368 00000000 0000080

With regards to frequency, it probably crashes 50% of the time that I run the executable.

From what I've read online, it sounds like XiD 13 is analogous to a host-based seg fault. However, given the array indexing, I can't see how that could be the case. Furthermore the program doesn't crash if I replace the transcendental functions in the kernel with other functions (e.g. repeated floating point subtraction and addition). That is, I don't get the XiD error message, and the program ultimately returns the correct value of atanh(0.7).

I'm running cuda-5.0 on Ubuntu 11.10 x64 Desktop. Driver version is 304.54, and I'm using a GeForce 9800 GTX.

I'm inclined to say that this is a hardware issue or a driver bug. What's strange is that the example applications from nvidia work fine, perhaps because they do not use the affected transcendental functions.

The final bit of potentially important information is that if I run either my main project, or this test program under cuda-memcheck, it reports no errors, and never crashes. Honestly, I'd just run my project under cuda-memcheck, but the performance hit makes it impractical.

Thanks in advance for any help/insight here. If any one has a 9800 GTX and would be willing to run this code to see if it works, it would be greatly appreciated.

#include <iostream>
#include <stdlib.h>

using namespace std;

__global__ void test_trans (float *a, int length) {
  if ((threadIdx.x + blockDim.x*blockIdx.x) < length) {
    float temp=0.7;
    for (int i=0;i<100;i++) {
      temp=atanh(temp);
      temp=tanh(temp);
    }
    a[threadIdx.x+ blockDim.x*blockIdx.x] = atanh(temp);
  }
}

int main () {
  float *array_dev;
  float *array_host;
  unsigned int size=10000000;
  if (cudaSuccess != cudaMalloc ((void**)&array_dev, size*sizeof(float)) ) {
    cerr << "Error with memory Allocation\n"; exit (-1);}
  array_host = new float [size];

  for (int i=0;i<10;i++) {
    test_trans <<< size/512+1, 512 >>> (array_dev, size);
    if (cudaSuccess != cudaDeviceSynchronize()) {
       cerr << "Error with kernel\n"; exit (-1);}
  }
  cudaMemcpy (array_host, array_dev, sizeof(float)*size, cudaMemcpyDeviceToHost);

  cout << array_host[size-1] << "\n";  
}

Edit: I dropped this project for a few months, but yesterday upon updating to driver version 319.23, I'm no longer having this problem. I think the issue I described must have been a bug that was fixed. Hope this helps.

like image 857
njohn5188 Avatar asked Nov 13 '22 08:11

njohn5188


1 Answers

The asker determined that this was a temporary issue fixed by a newer CUDA release. See the edit to the original question.

like image 194
harrism Avatar answered Dec 25 '22 13:12

harrism