I've recently begun learning CUDA, and I've stumbled upon a very strange behavior I can't understand.
My code essentially computes an average execution time for a simple atomicAdd kernel. To accomplish this, I call the kernel in a loop to get a better average. I include the device memory allocation and copies in the loop as I want to include this in my execution time estimate. The problem is, the program often fails with Runtime API error 30 if the number of runs through the loop is too high.
I suspected that I might have an issue with my memory access, so I've run memcheck on the program to no avail. There are apparently no memory errors. Also, if run the kernel only a few times, there are no issues, which would also seem to indicate the kernel isn't exactly the issue. It's only if I call it too frequently in succession that I have problems.
A skeleton of my code follows:
for(int i = 0; i < runs; i++)
{
//////////////////////////////////
// Copy memory from Host to Device
//////////////////////////////////
cutilSafeCallNoSync( cudaMemcpy(dev_waveforms, waveforms, num_wf * wf_length * sizeof(float),
cudaMemcpyHostToDevice) );
cutilSafeCallNoSync( cudaMemcpy(dev_delays, delays, num_wf * sizeof(int),
cudaMemcpyHostToDevice) );
////////////////////////
// Kernel Call
////////////////////////
kernel_wrapper<float>(dev_waveforms, dev_focused, dev_delays,
wf_length, num_wf, threads, blocks, kernel);
//copy back to host memory.
cutilSafeCallNoSync( cudaMemcpy(focused, dev_focused, J * wf_length * sizeof(float),
cudaMemcpyDeviceToHost) );
}
Again, this only fails if runs is sufficiently large. There are other strange things going on, but I'll leave it at this for now.
Oh, I'm developing on Windows 7 using Visual Studio 2010. My GPU is also acting as my video card, and I'm worried this may have strange effects.
Thanks in advance!
The Windows 7 driver may batch up multiple commands into a single submission to get around the increased driver overhead of the WDDM (compared to pre-WDDM drivers, e.g. Win XP). For this reason even if a single kernel does not exceed the watchdog, running in a loop like this might. You could call cudaDeviceSynchronize() as @RogerDahl suggests to try to work around it (possibly only every N iterations).
Or run on Linux.
Edit:
Runtime Error 30 is an unknown error. If this were a watchdog timer timeout, I would expect a cudaErrorLaunchTimeout (error 6). Since you didn't provide full code, it's hard to say what is causing the error. I suspect there is a bug in your kernel code.
I ran into the same error and found that my kernel was actually overrunning the memory I had allocated. Since you doubled your buffers and saw the problem go away, I would expect that you may be experiencing the same issue.
My issue was a bug in my math to determine how many threads and blocks to launch. I was launching eight times as many blocks as I intended. Inside my kernel, the math to determine which element a given thread should work on resulted in accessing way outside my array.
Make sure you check which element(s) of the array each thread is working with to prevent execution of threads that would be accessing/modifying memory outside the array.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With