Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google Colab RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Yesterday and today running the same Python notebooks that I am running the past few months, I am getting the error

/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
 97     Variable._execution_engine.run_backward(
 98         tensors, grad_tensors, retain_graph, create_graph,
 ---> 99         allow_unreachable=True)  # allow_unreachable flag
100 
101 

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

The point in the code where this error seems to be random since it changes from try to try. From what I have searched, it looks to be a compatibility issue.

Also, if I rerun the cell, I might get another error which is,

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in __next__(self)
346         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
347         if self._pin_memory:
--> 348             data = _utils.pin_memory.pin_memory(data)
349         return data
350 

/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/pin_memory.py in pin_memory(data)
 53         return type(data)(*(pin_memory(sample) for sample in data))
 54     elif isinstance(data, container_abcs.Sequence):
 ---> 55         return [pin_memory(sample) for sample in data]
 56     elif hasattr(data, "pin_memory"):
 57         return data.pin_memory()

 /usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/pin_memory.py in <listcomp>(.0)
 53         return type(data)(*(pin_memory(sample) for sample in data))
 54     elif isinstance(data, container_abcs.Sequence):
 ---> 55         return [pin_memory(sample) for sample in data]
 56     elif hasattr(data, "pin_memory"):
 57         return data.pin_memory()

 /usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils /pin_memory.py in pin_memory(data)
 45 def pin_memory(data):
 46     if isinstance(data, torch.Tensor):
 ---> 47         return data.pin_memory()
 48     elif isinstance(data, string_classes):
 49         return data

 RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278

Does anyone else have the same problem? Did anyone solve it, how?

like image 750
vpap Avatar asked Oct 15 '22 08:10

vpap


1 Answers

Finally, I solved the problem.

  1. Somewhere in my code I use a CrossEntropyLoss function with ignore_index parameter as ignore_index = my_ignore_index. By mistake, I had my_ignore_index = -1 which as value, it is not a valid value for my data; -1 never appears in my data values. Updating correctly solved the problem. This solved the "... an illegal memory access was encou..." error.

  2. The other thing that I did and helped to solve the problem was to use a newer version of anaconda3. This solved the CUDNN_STATUS_NOT_INITIALIZED error.

I hope that helps.

like image 66
vpap Avatar answered Oct 20 '22 11:10

vpap