What does "RuntimeError: CUDA error: device-side assert triggered" in PyTorch mean?

Tags:

I have seen a lot of specific posts to particular case-specific problems, but no fundamental motivating explanation. What does this error:

RuntimeError: CUDA error: device-side assert triggered

mean? Specifically, what is the assert that is being triggered, why is the assert there, and how do we work backwards to debug the problem?

As-is, this error message is near useless in diagnosing any problem because of the generality that it seems to say "some code somewhere that touches the GPU" has a problem. The documentation of Cuda also does not seem helpful in this regard, though I could be wrong. https://docs.nvidia.com/cuda/cuda-gdb/index.html

250

asked Apr 21 '19 07:04

Joseph Konan

3 Answers

When I shifted my code to work on CPU instead of GPU, I got the following error:

IndexError: index 128 is out of bounds for dimension 0 with size 128

So, perhaps there might be a mistake in the code which for some strange reason comes out as a CUDA error.

175

answered Oct 16 '22 11:10

Hrushi

When a device-side error is detected while CUDA device code is running, that error is reported via the usual CUDA runtime API error reporting mechanism. The usual detected error in device code would be something like an illegal address (e.g. attempt to dereference an invalid pointer) but another type is a device-side assert. This type of error is generated whenever a C/C++ assert() occurs in device code, and the assert condition is false.

Such an error occurs as a result of a specific kernel. Runtime error checking in CUDA is necessarily asynchronous, but there are probably at least 3 possible methods to start to debug this.

Modify the source code to effectively convert asynchronous kernel launches to synchronous kernel launches, and do rigorous error-checking after each kernel launch. This will identify the specific kernel that has caused the error. At that point it may be sufficient simply to look at the various asserts in that kernel code, but you could also use step 2 or 3 below.
Run your code with cuda-memcheck. This is a tool something like "valgrind for device code". When you run your code with cuda-memcheck, it will tend to run much more slowly, but the runtime error reporting will be enhanced. It is also usually preferable to compile your code with -lineinfo. In that scenario, when a device-side assert is triggered, cuda-memcheck will report the source code line number where the assert is, and also the assert itself and the condition that was false. You can see here for a walkthrough of using it (albeit with an illegal address error instead of assert(), but the process with assert() will be similar.
It should also be possible to use a debugger. If you use a debugger such as cuda-gdb (e.g. on linux) then the debugger will have back-trace reports that will indicate which line the assert was, when it was hit.

Both cuda-memcheck and the debugger can be used if the CUDA code is launched from a python script.

At this point you have discovered what the assert is and where in the source code it is. Why it is there cannot be answered generically. This will depend on the developers intention, and if it is not commented or otherwise obvious, you will need some method to intuit that somehow. The question of "how to work backwards" is also a general debugging question, not specific to CUDA. You can use printf in CUDA kernel code, and also a debugger like cuda-gdb to assist with this (for example, set a breakpoint prior to the assert, and inspect machine state - e.g. variables - when the assert is about to be hit).

With newer GPUs, instead of cuda-memcheck you will probably want to use compute-sanitizer. It works in a similar fashion.

answered Oct 16 '22 11:10

Robert Crovella

In my case, this error is caused because my loss function just receive values between [0, 1], and i was passing other values.

So, normalizing my loss function input, solve this:

    saida_G -= saida_G.min(1, keepdim=True)[0]
    saida_G /= saida_G.max(1, keepdim=True)[0]

Read this: link

answered Oct 16 '22 10:10

Mateus Baltazar

Related questions
                            
                                Load all images from a folder using PIL
                            
                                Getting data from hidden html (popup) using BS4
                            
                                Submit a form using POST with g-recaptcha-response argument
                            
                                Pyspark Error: "Py4JJavaError: An error occurred while calling o655.count." when calling count() method on dataframe
                            
                                How to prevent popping-up xdg-open dialogue from Ubuntu chrome while opening specific link?
                            
                                subprocess.run() doesn't return stdout or stderr
                            
                                Cartopy examples produce a Segmentation fault
                            
                                OpenCV & Python Multithreading - Seeking within a VideoCapture Object
                            
                                Python Regular Expressions to NFA
                            
                                Render dynamically changing images with same filenames in Flask
                            
                                How to get interactive bokeh in Jupyter notebook
                            
                                asyncio: RuntimeError this event loop is already running
                            
                                How does tf.keras.layers.Conv2D with padding='same' and strides > 1 behave?
                            
                                Python logging: disable stack trace
                            
                                Why aren't torch.nn.Parameter listed when net is printed?
                            
                                How PyCharm imports differently than system command prompt (Windows)
                            
                                Keras load_model with custom objects doesn't work properly
                            
                                Iterate over two Pytorch tensors at once?
                            
                                Search for bitstring most unlike a set of bitstrings
                            
                                random.randint shows different output in Python 2.x and Python 3.x with same seed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What does "RuntimeError: CUDA error: device-side assert triggered" in PyTorch mean?

Tags:

python

gpu

pytorch