I get a Cuda error 6 (also known as cudaErrorLaunchTimeout
and CUDA_ERROR_LAUNCH_TIMEOUT
) with this (simplified) code:
for(int i = 0; i < 650; ++i)
{
int param = foo(i); //some CPU computation here, but no memory copy
MyKernel<<<dimGrid, dimBlock>>>(&data, param);
}
The Cuda error 6 indicates that the kernel took too much time to return. The duration of a single MyKernel
is only ~60 ms though. The block size is a classic 16×16.
Now, when I call cudaDeviceSynchronize()
every, say, 50 iterations, the error doesn't occur:
for(int i = 0; i < 650; ++i)
{
int param = foo(i); //some CPU computation here, but no memory copy
MyKernel<<<dimGrid, dimBlock>>>(&data, param);
if(i % 50 == 0) cudaDeviceSynchronize();
}
I would like to avoid this synchronization, because it slows the program down a lot.
Since kernel launches are asynchronous, I guess the error occurs because the watchdog measures the execution duration of a kernel from its asynchronous launch, and not from the actual beginning of its execution.
I am new to Cuda. Is this a common case for the error 6 to occur? Is there a way to avoid this error without altering the performance?
Thanks to talonmies and Robert Crovella (whose proposed solution didn't work for me), I've been able to find an acceptable workaround.
To prevent the CUDA driver to batch the kernel launches together, another operation must be performed before or after each kernel launch. E.g. a dummy copy does the trick:
void* dummy;
cudaMalloc(&dummy, 1);
for(int i = 0; i < 650; ++i)
{
int param = foo(i); //some CPU computation here, but no memory copy
cudaMemcpyAsync(dummy, dummy, 1, cudaMemcpyDeviceToDevice);
MyKernel<<<dimGrid, dimBlock>>>(&data, param);
}
This solution is 8 seconds faster (50s to 42s) than the one that includes calls to cudaDeviceSynchronize()
(see question).
Besides, it's more reliable, 50
being an arbitrary, device-specific period.
The watchdog isn't measuring execution time of kernels, per se. The watchdog is keeping track of requests in the command queue that goes to the GPU, and determining if any of them have not been acknowledged by the GPU within a timeout period.
As @talonmies indicated in the comments, my best guess is that (if you are certain that no kernel execution exceeds the timeout period) this behavior is due to the CUDA driver WDDM batching mechanism, which seeks to reduce average latency by batching GPU commands together and sending to the GPU, in batches.
You don't have direct control over the batching behavior, and so in general, trying to work around this without disabling or modifying the windows TDR mechanism will be an imprecise exercise.
The general (somewhat undocumented) suggestion for a low-cost "flush" of the command queue, which you might try experimenting with, is to use cudaEventQuery(0);
(as suggested here) in place of cudaDeviceSynchronize();
, perhaps every 50 kernel launches or so. To some degree the specifics may depend on the machine configuration, and the GPU in use.
I'm not sure how effective it will be in your case. I don't think that it can be advanced as a "guarantee" of avoiding a TDR event without a lot more experimentation. Your mileage may vary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With