I want to run some experiments on my GPU device, but I get this error: <blockquote> RuntimeError: CUDA out of memory. Tried to allocate 3.63 GiB (GPU 0; 15.90 GiB total capacity; 13.65 GiB already allocated; 1.57 GiB free; 13.68 GiB reserved in total by PyTorch) </blockquote> I read about possible solutions here, and the common solution is this: <blockquote> It is because of mini-batch of data does not fit onto GPU memory. Just decrease the batch size. When I set batch size = 256 for cifar10 dataset I got the same error; Then I set the batch size = 128, it is solved. </blockquote> But in my case, it is a research project, and I want to have specific hyper-parameters and I can not reduce anything such as batch size. Does anyone have a solution for this?

As long as a single sample can fit into GPU memory, you do not have to reduce the effective batch size: you can do gradient accumulation. Instead of updating the weights after every iteration (based on gradients computed from a too-small mini-batch) you can accumulate the gradients for several mini-batches and only when seeing enough examples, only then updating the weights. This is nicely explained in this video. Effectively, your training code would look something like this. Suppose your large batch size is <code>large_batch</code>, but can only fit <code>small_batch</code> into GPU memory, such that <code>large_batch = small_batch * k</code>. Then you want to update the weights every <code>k</code> iterations: <pre class="prettyprint lang-py prettyprint-override"><code>train_data = DataLoader(train_set, batch_size=small_batch, ...) opt.zero_grad() # this signifies the start of a large_batch for i, (x, y) in train_data: pred = model(x) loss = criterion(pred, y) loss.backward() # gradeints computed for small_batch if (i+1) % k == 0 or (i+1) == len(train_data): opt.step() # update the weights only after accumulating k small batches opt.zero_grad() # reset gradients for accumulation for the next large_batch </code></pre>

CUDA out of memory error, cannot reduce batch size

Tags:

python

pytorch

I want to run some experiments on my GPU device, but I get this error:

RuntimeError: CUDA out of memory. Tried to allocate 3.63 GiB (GPU 0; 15.90 GiB total capacity; 13.65 GiB already allocated; 1.57 GiB free; 13.68 GiB reserved in total by PyTorch)

I read about possible solutions here, and the common solution is this:

It is because of mini-batch of data does not fit onto GPU memory. Just decrease the batch size. When I set batch size = 256 for cifar10 dataset I got the same error; Then I set the batch size = 128, it is solved.

But in my case, it is a research project, and I want to have specific hyper-parameters and I can not reduce anything such as batch size.

Does anyone have a solution for this?

778

asked Jul 22 '21 04:07

b.j

Video Answer

2 Answers

As long as a single sample can fit into GPU memory, you do not have to reduce the effective batch size: you can do gradient accumulation. Instead of updating the weights after every iteration (based on gradients computed from a too-small mini-batch) you can accumulate the gradients for several mini-batches and only when seeing enough examples, only then updating the weights.
This is nicely explained in this video.

Effectively, your training code would look something like this. Suppose your large batch size is large_batch, but can only fit small_batch into GPU memory, such that large_batch = small_batch * k. Then you want to update the weights every k iterations:

train_data = DataLoader(train_set, batch_size=small_batch, ...)

opt.zero_grad()  # this signifies the start of a large_batch
for i, (x, y) in train_data:
  pred = model(x)
  loss = criterion(pred, y)
  loss.backward()  # gradeints computed for small_batch
  if (i+1) % k == 0 or (i+1) == len(train_data):
    opt.step()  # update the weights only after accumulating k small batches
    opt.zero_grad()  # reset gradients for accumulation for the next large_batch

199

answered Oct 20 '22 02:10

Shai

Shai's answer is suitable, but I want to offer another solution. Recently, I've been observing awesome results from Nvidia AMP - Automatic Mixed Precision, which is a nice combination of the advantages of fp16 vs fp32. A positive side effect is that it significantly speeds up training as well.

It's only a single line of code in tensorflow: opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)

More details here

You can also stack AMP with Shai's solution.

answered Oct 20 '22 01:10

Stanley Zheng

Related questions
                            
                                django.contrib.auth.login() function not returning any user as logged in
                            
                                Pivoting pandas dataframe by rank on id
                            
                                Zen of Python 'Explicit is better than implicit'
                            
                                How I can aggregate employee based on their department and show average salary in each department using groupby pandas?
                            
                                How to replace multiple forward slashes in a directory by a single slash?
                            
                                Selenium app redirect to Cloudflare page when hosted on Heroku
                            
                                Replace values in pandas dataframe column with different replacement dict based on condition
                            
                                How to run selenium+chrome on Raspberry PI 4?
                            
                                Set default value for selectbox
                            
                                How to sum rows in the same column than the category in pandas dataframe - python
                            
                                Why is iterating over a dict so slow?
                            
                                TypeError: load_pem_private_key() missing 1 required positional argument: 'backend'
                            
                                Python cannot be opened when launching PyCharm CE
                            
                                Why cannot add PPA deadsnakes?
                            
                                How to cause Jupyter Lab to save notebook (programmatically)
                            
                                Faster way to sum all combinations of rows in dataframe
                            
                                django admin site nav sidebar messed up
                            
                                Given a Python list of lists, find all possible flat lists that keeps the order of each sublist?
                            
                                Type hint for return value in subclass
                            
                                How do I get Pylance to ignore the possibility of None?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With