I have a kernel, which might call asm("trap;")
inside kernel. But when that happens, the CUDA error code is set to launch fail, and I cannot reset it.
In CUDA Runtime API, we can use cudaGetLastError
to get the last error and in the mean time, reset it to cudaSuccess
.
Is there a way to do that with Driver API?
This type of error cannot be reset with the CUDA Runtime API cudaGetLastError()
function.
There are two types of CUDA runtime errors: "sticky" and "non-sticky". "non-sticky" errors are those which do not corrupt the context. For example, a cudaMalloc
request that is asking for more than the available memory will fail, but it will not corrupt the context. Such an error is "non-sticky".
Errors that involve unexpected termination of a CUDA kernel (including your trap
example, also in-kernel assert()
failures, also runtime detected execution errors such as out-of-bounds accesses) are "sticky". You cannot clear "sticky" errors with cudaGetLastError()
. The only method to clear these errors in the runtime API is cudaDeviceReset()
(which eliminates all device allocations, and wipes out the context).
The corresponding driver API function is cuDevicePrimaryCtxReset()
Note that cudaDeviceReset()
by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With