I am a beginner at PyTorch and I am just trying out some examples on this webpage. But I can't seem to get the 'super_resolution' program running due to this error:
RuntimeError: DataLoader worker (pid(s) 15332) exited unexpectedly
I searched the Internet and found that some people suggest setting num_workers
to 0
. But if I do that, the program tells me that I am running out of memory (either with CPU or GPU):
RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 9663676416 bytes. Buy new RAM!
or
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 4.00 GiB total capacity; 2.03 GiB already allocated; 0 bytes free; 2.03 GiB reserved in total by PyTorch)
How do I fix this?
I am using python 3.8 on Win10(64bit) and pytorch 1.4.0.
More complete error messages (--cuda
means using GPU, --threads x
means passing x
to the num_worker
parameter):
--upscale_factor 1 --cuda
File "E:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 761, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "E:\Python38\lib\multiprocessing\queues.py", line 108, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "Z:\super_resolution\main.py", line 81, in <module>
train(epoch)
File "Z:\super_resolution\main.py", line 48, in train
for iteration, batch in enumerate(training_data_loader, 1):
File "E:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 345, in __next__
data = self._next_data()
File "E:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 841, in _next_data
idx, data = self._get_data()
File "E:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 808, in _get_data
success, data = self._try_get_data()
File "E:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 774, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 16596, 9376, 12756, 9844) exited unexpectedly
--upscale_factor 1 --cuda --threads 0
File "Z:\super_resolution\main.py", line 81, in <module>
train(epoch)
File "Z:\super_resolution\main.py", line 52, in train
loss = criterion(model(input), target)
File "E:\Python38\lib\site-packages\torch\nn\modules\module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "Z:\super_resolution\model.py", line 21, in forward
x = self.relu(self.conv2(x))
File "E:\Python38\lib\site-packages\torch\nn\modules\module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "E:\Python38\lib\site-packages\torch\nn\modules\conv.py", line 345, in forward
return self.conv2d_forward(input, self.weight)
File "E:\Python38\lib\site-packages\torch\nn\modules\conv.py", line 341, in conv2d_forward
return F.conv2d(input, weight, self.bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 4.00 GiB total capacity; 2.03 GiB already allocated; 954.35 MiB free; 2.03 GiB reserved in total by PyTorch)
Are you sure that RuntimeError: DataLoader worker (pid 30141) exited unexpectedly with exit code 1. is the only error you get? You should be getting an error from the worker process as well as an error from the main loader process (the one you posted).
Sorry, something went wrong. Are you sure that RuntimeError: DataLoader worker (pid 30141) exited unexpectedly with exit code 1. is the only error you get? You should be getting an error from the worker process as well as an error from the main loader process (the one you posted).
Looks like torch is trying to call cudaGetDevice on a forked child process (data loader worker) in pytorch/torch/csrc/autograd/profiler_cuda.cpp. The worker throws a runtime error and brings down the process tree. One simple solution that may work is to have the profiler check its pid and stop if the pid changes.
There is no "complete" solve for GPU out of memory errors, but there are quite a few things you can do to relieve the memory demand. Also, make sure that you are not passing the trainset and testset to the GPU at the same time!
Alternatively, you can try running on Google Colaboratory (12 hour usage limit on K80 GPU) and Next Journal, both of which provide up to 12GB for use, free of charge. Worst case scenario, you might have to conduct training on your CPU. Hope this helps!
This is the solution that worked for me. it may work for other Windows users.
Just remove/comment the num workers
to disable parallel loads
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With