Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do I get CUDA out of memory when running PyTorch model [with enough GPU memory]?

I am asking this question because I am successfully training a segmentation network on my GTX 2070 on laptop with 8GB VRAM and I use exactly the same code and exactly the same software libraries installed on my desktop PC with a GTX 1080TI and it still throws out of memory.

Why does this happen, considering that:

  1. The same Windows 10 + CUDA 10.1 + CUDNN 7.6.5.32 + Nvidia Driver 418.96 (comes along with CUDA 10.1) are both on laptop and on PC.

  2. The fact that training with TensorFlow 2.3 runs smoothly on the GPU on my PC, yet it fails allocating memory for training only with PyTorch.

  3. PyTorch recognises the GPU (prints GTX 1080 TI) via the command : print(torch.cuda.get_device_name(0))

  4. PyTorch allocates memory when running this command: torch.rand(20000, 20000).cuda() #allocated 1.5GB of VRAM.

What is the solution to this?

like image 456
Timbus Calin Avatar asked Aug 17 '20 10:08

Timbus Calin


1 Answers

Most of the people (even in the thread below) jump to suggest that decreasing the batch_size will solve this problem. In fact, it does not in this case. For example, it would have been illogical for a network to train on 8GB VRAM and yet to fail to train on 11GB VRAM, considering that there were no other applications consuming video memory on the system with 11GB VRAM and the exact same configuration is installed and used.

The reason why this happened in my case was that, when using the DataLoader object, I set a very high (12) value for the workers parameter. Decreasing this value to 4 in my case solved the problem.

In fact, although at the bottom of the thread, the answer provided by Yurasyk at https://github.com/pytorch/pytorch/issues/16417#issuecomment-599137646 pointed me in the right direction.

Solution: Decrease the number of workers in the PyTorch DataLoader. Although I do not exactly understand why this solution works, I assume it is related to the threads spawned behind the scenes for data fetching; it may be the case that, on some processors, such an error appears.

like image 157
Timbus Calin Avatar answered Sep 22 '22 08:09

Timbus Calin