How can I reset the CUDA error to success with Driver API after a trap instruction?

Question

I have a kernel, which might call asm("trap;") inside kernel. But when that happens, the CUDA error code is set to launch fail, and I cannot reset it.

In CUDA Runtime API, we can use cudaGetLastError to get the last error and in the mean time, reset it to cudaSuccess.

Is there a way to do that with Driver API?

Robert Crovella · Accepted Answer

This type of error cannot be reset with the CUDA Runtime API cudaGetLastError() function.

There are two types of CUDA runtime errors: "sticky" and "non-sticky". "non-sticky" errors are those which do not corrupt the context. For example, a cudaMalloc request that is asking for more than the available memory will fail, but it will not corrupt the context. Such an error is "non-sticky".

Errors that involve unexpected termination of a CUDA kernel (including your trap example, also in-kernel assert() failures, also runtime detected execution errors such as out-of-bounds accesses) are "sticky". You cannot clear "sticky" errors with cudaGetLastError(). The only method to clear these errors in the runtime API is cudaDeviceReset() (which eliminates all device allocations, and wipes out the context).

The corresponding driver API function is cuDevicePrimaryCtxReset()

Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.

How can I reset the CUDA error to success with Driver API after a trap instruction?

Tags:

error-handling

cuda

cuda-driver

Xiang Zhang

1 Answers

Robert Crovella

Recent Activity

Donate For Us

How can I reset the CUDA error to success with Driver API after a trap instruction?

Tags:

error-handling

cuda

cuda-driver

Xiang Zhang

1 Answers

Robert Crovella

Related questions

Recent Activity

Donate For Us