Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does "RuntimeError: CUDA error: device-side assert triggered" in PyTorch mean?

I have seen a lot of specific posts to particular case-specific problems, but no fundamental motivating explanation. What does this error:

RuntimeError: CUDA error: device-side assert triggered

mean? Specifically, what is the assert that is being triggered, why is the assert there, and how do we work backwards to debug the problem?

As-is, this error message is near useless in diagnosing any problem because of the generality that it seems to say "some code somewhere that touches the GPU" has a problem. The documentation of Cuda also does not seem helpful in this regard, though I could be wrong. https://docs.nvidia.com/cuda/cuda-gdb/index.html

like image 250
Joseph Konan Avatar asked Apr 21 '19 07:04

Joseph Konan


People also ask

Why is Cuda out of memory?

In my model, it appears that “cuda runtime error(2): out of memory” is occurring due to a GPU memory drain. Because PyTorch typically manages large amounts of data, failure to recognize small errors can cause your program to crash to the ground without all its GPU being available.


3 Answers

When I shifted my code to work on CPU instead of GPU, I got the following error:

IndexError: index 128 is out of bounds for dimension 0 with size 128

So, perhaps there might be a mistake in the code which for some strange reason comes out as a CUDA error.

like image 175
Hrushi Avatar answered Oct 16 '22 11:10

Hrushi


When a device-side error is detected while CUDA device code is running, that error is reported via the usual CUDA runtime API error reporting mechanism. The usual detected error in device code would be something like an illegal address (e.g. attempt to dereference an invalid pointer) but another type is a device-side assert. This type of error is generated whenever a C/C++ assert() occurs in device code, and the assert condition is false.

Such an error occurs as a result of a specific kernel. Runtime error checking in CUDA is necessarily asynchronous, but there are probably at least 3 possible methods to start to debug this.

  1. Modify the source code to effectively convert asynchronous kernel launches to synchronous kernel launches, and do rigorous error-checking after each kernel launch. This will identify the specific kernel that has caused the error. At that point it may be sufficient simply to look at the various asserts in that kernel code, but you could also use step 2 or 3 below.

  2. Run your code with cuda-memcheck. This is a tool something like "valgrind for device code". When you run your code with cuda-memcheck, it will tend to run much more slowly, but the runtime error reporting will be enhanced. It is also usually preferable to compile your code with -lineinfo. In that scenario, when a device-side assert is triggered, cuda-memcheck will report the source code line number where the assert is, and also the assert itself and the condition that was false. You can see here for a walkthrough of using it (albeit with an illegal address error instead of assert(), but the process with assert() will be similar.

  3. It should also be possible to use a debugger. If you use a debugger such as cuda-gdb (e.g. on linux) then the debugger will have back-trace reports that will indicate which line the assert was, when it was hit.

Both cuda-memcheck and the debugger can be used if the CUDA code is launched from a python script.

At this point you have discovered what the assert is and where in the source code it is. Why it is there cannot be answered generically. This will depend on the developers intention, and if it is not commented or otherwise obvious, you will need some method to intuit that somehow. The question of "how to work backwards" is also a general debugging question, not specific to CUDA. You can use printf in CUDA kernel code, and also a debugger like cuda-gdb to assist with this (for example, set a breakpoint prior to the assert, and inspect machine state - e.g. variables - when the assert is about to be hit).

With newer GPUs, instead of cuda-memcheck you will probably want to use compute-sanitizer. It works in a similar fashion.

like image 24
Robert Crovella Avatar answered Oct 16 '22 11:10

Robert Crovella


In my case, this error is caused because my loss function just receive values between [0, 1], and i was passing other values.

So, normalizing my loss function input, solve this:

    saida_G -= saida_G.min(1, keepdim=True)[0]
    saida_G /= saida_G.max(1, keepdim=True)[0]

Read this: link

like image 1
Mateus Baltazar Avatar answered Oct 16 '22 10:10

Mateus Baltazar