Why do we need to explicitly zero the gradients in PyTorch? Why can't gradients be zeroed when <code>loss.backward()</code> is called? What scenario is served by keeping the gradients on the graph and asking the user to explicitly zero the gradients?

We explicitly need to call <code>zero_grad()</code> because, after <code>loss.backward()</code> (when gradients are computed), we need to use <code>optimizer.step()</code> to proceed gradient descent. More specifically, the gradients are not automatically zeroed because these two operations, <code>loss.backward()</code> and <code>optimizer.step()</code>, are separated, and <code>optimizer.step()</code> requires the just computed gradients. In addition, sometimes, we need to accumulate gradient among some batches; to do that, we can simply call <code>backward</code> multiple times and optimize once.

Why do we need to explicitly call zero_grad()? [duplicate]

Tags:

neural-network

deep-learning

pytorch

gradient-descent

Why do we need to explicitly zero the gradients in PyTorch? Why can't gradients be zeroed when loss.backward() is called? What scenario is served by keeping the gradients on the graph and asking the user to explicitly zero the gradients?

841

asked Jun 24 '17 02:06

Wasi Ahmad

Video Answer

1 Answers

We explicitly need to call zero_grad() because, after loss.backward() (when gradients are computed), we need to use optimizer.step() to proceed gradient descent. More specifically, the gradients are not automatically zeroed because these two operations, loss.backward() and optimizer.step(), are separated, and optimizer.step() requires the just computed gradients.

In addition, sometimes, we need to accumulate gradient among some batches; to do that, we can simply call backward multiple times and optimize once.

answered Oct 12 '22 14:10

danche

Related questions
                            
                                How to calculate prediction uncertainty using Keras?
                            
                                What is the purpose of the add_loss function in Keras?
                            
                                Neural Network Always Produces Same/Similar Outputs for Any Input
                            
                                What is the difference between Keras model.evaluate() and model.predict()?
                            
                                How to calculate the number of parameters of convolutional neural networks?
                            
                                CNN - Image Resizing VS Padding (keeping aspect ratio or not?)
                            
                                Calling "fit" multiple times in Keras
                            
                                Why is the accuracy for my Keras model always 0 when training?
                            
                                Difference between Keras model.save() and model.save_weights()?
                            
                                Neural Network example in .NET [closed]
                            
                                What does the standard Keras model output mean? What is epoch and loss in Keras?
                            
                                PyTorch / Gensim - How to load pre-trained word embeddings
                            
                                Saving best model in keras
                            
                                Understanding Neural Network Backpropagation
                            
                                What is a batch in TensorFlow?
                            
                                In which cases is the cross-entropy preferred over the mean squared error? [closed]
                            
                                gensim word2vec: Find number of words in vocabulary
                            
                                Can neural networks approximate any function given enough hidden neurons?
                            
                                Why does prediction needs batch size in Keras?
                            
                                What is a projection layer in the context of neural networks?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With