Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I reset the CUDA error to success with Driver API after a trap instruction?

I have a kernel, which might call asm("trap;") inside kernel. But when that happens, the CUDA error code is set to launch fail, and I cannot reset it.

In CUDA Runtime API, we can use cudaGetLastError to get the last error and in the mean time, reset it to cudaSuccess.

Is there a way to do that with Driver API?

like image 399
Xiang Zhang Avatar asked Mar 09 '23 05:03

Xiang Zhang


1 Answers

This type of error cannot be reset with the CUDA Runtime API cudaGetLastError() function.

There are two types of CUDA runtime errors: "sticky" and "non-sticky". "non-sticky" errors are those which do not corrupt the context. For example, a cudaMalloc request that is asking for more than the available memory will fail, but it will not corrupt the context. Such an error is "non-sticky".

Errors that involve unexpected termination of a CUDA kernel (including your trap example, also in-kernel assert() failures, also runtime detected execution errors such as out-of-bounds accesses) are "sticky". You cannot clear "sticky" errors with cudaGetLastError(). The only method to clear these errors in the runtime API is cudaDeviceReset() (which eliminates all device allocations, and wipes out the context).

The corresponding driver API function is cuDevicePrimaryCtxReset()

Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.

like image 173
Robert Crovella Avatar answered Apr 24 '23 23:04

Robert Crovella