Yesterday and today running the same Python notebooks that I am running the past few months, I am getting the error
/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
97 Variable._execution_engine.run_backward(
98 tensors, grad_tensors, retain_graph, create_graph,
---> 99 allow_unreachable=True) # allow_unreachable flag
100
101
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
The point in the code where this error seems to be random since it changes from try to try. From what I have searched, it looks to be a compatibility issue.
Also, if I rerun the cell, I might get another error which is,
/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in __next__(self)
346 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
347 if self._pin_memory:
--> 348 data = _utils.pin_memory.pin_memory(data)
349 return data
350
/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/pin_memory.py in pin_memory(data)
53 return type(data)(*(pin_memory(sample) for sample in data))
54 elif isinstance(data, container_abcs.Sequence):
---> 55 return [pin_memory(sample) for sample in data]
56 elif hasattr(data, "pin_memory"):
57 return data.pin_memory()
/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/pin_memory.py in <listcomp>(.0)
53 return type(data)(*(pin_memory(sample) for sample in data))
54 elif isinstance(data, container_abcs.Sequence):
---> 55 return [pin_memory(sample) for sample in data]
56 elif hasattr(data, "pin_memory"):
57 return data.pin_memory()
/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils /pin_memory.py in pin_memory(data)
45 def pin_memory(data):
46 if isinstance(data, torch.Tensor):
---> 47 return data.pin_memory()
48 elif isinstance(data, string_classes):
49 return data
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278
Does anyone else have the same problem? Did anyone solve it, how?
Finally, I solved the problem.
Somewhere in my code I use a CrossEntropyLoss
function with ignore_index parameter as ignore_index = my_ignore_index
. By mistake, I had my_ignore_index = -1
which as value, it is not a valid value for my data; -1 never appears in my data values. Updating correctly solved the problem. This solved the "... an illegal memory access was encou..." error.
The other thing that I did and helped to solve the problem was to use a newer version of anaconda3. This solved the CUDNN_STATUS_NOT_INITIALIZED
error.
I hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With