Here they mention the need to include <code>optim.zero_grad()</code> when training to zero the parameter gradients. My question is: Could I do as well <code>net.zero_grad()</code> and would that have the same effect? Or is it necessary to do <code>optim.zero_grad()</code>. Moreover, what happens if I do both? If I do none, then the gradients get accumulated, but what does that exactly mean? do they get added? In other words, what's the difference between doing <code>optim.zero_grad()</code> and <code>net.zero_grad()</code>. I am asking because here, line 115 they use <code>net.zero_grad()</code> and it is the first time I see that, that is an implementation of a reinforcement learning algorithm, where one has to be especially careful with the gradients because there are multiple networks and gradients, so I suppose there is a reason for them to do <code>net.zero_grad()</code> as opposed to <code>optim.zero_grad()</code>.

<code>net.zero_grad()</code> sets the gradients of all its parameters (including parameters of submodules) to zero. If you call <code>optim.zero_grad()</code> that will do the same, but for all parameters that have been specified to be optimised. If you are using only <code>net.parameters()</code> in your optimiser, e.g. <code>optim = Adam(net.parameters(), lr=1e-3)</code>, then both are equivalent, since they contain the exact same parameters. You could have other parameters that are being optimised by the same optimiser, which are not part of <code>net</code>, in which case you would either have to manually set their gradients to zero and therefore keep track of all the parameters, or you can simply call <code>optim.zero_grad()</code> to ensure that all parameters that are being optimised, had their gradients set to zero. <blockquote> Moreover, what happens if I do both? </blockquote> Nothing, the gradients would just be set to zero again, but since they were already zero, it makes absolutely no difference. <blockquote> If I do none, then the gradients get accumulated, but what does that exactly mean? do they get added? </blockquote> Yes, they are being added to the existing gradients. In the backward pass the gradients in respect to every parameter are calculated, and then the gradient is added to the parameters' gradient (<code>param.grad</code>). That allows you to have multiple backward passes, that affect the same parameters, which would not be possible if the gradients were overwritten instead of being added. For example, you could accumulate the gradients over multiple batches, if you need bigger batches for training stability but don't have enough memory to increase the batch size. This is trivial to achieve in PyTorch, which is essentially leaving off <code>optim.zero_grad()</code> and delaying <code>optim.step()</code> until you have gathered enough steps, as shown in HuggingFace - Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups. That flexibility comes at the cost of having to manually set the gradients to zero. Frankly, one line is a very small cost to pay, even though many users won't make use of it and especially beginners might find it confusing.

net.zero_grad() vs optim.zero_grad() pytorch

Tags:

pytorch

reinforcement-learning

Here they mention the need to include optim.zero_grad() when training to zero the parameter gradients. My question is: Could I do as well net.zero_grad() and would that have the same effect? Or is it necessary to do optim.zero_grad(). Moreover, what happens if I do both? If I do none, then the gradients get accumulated, but what does that exactly mean? do they get added? In other words, what's the difference between doing optim.zero_grad() and net.zero_grad(). I am asking because here, line 115 they use net.zero_grad() and it is the first time I see that, that is an implementation of a reinforcement learning algorithm, where one has to be especially careful with the gradients because there are multiple networks and gradients, so I suppose there is a reason for them to do net.zero_grad() as opposed to optim.zero_grad().

614

asked May 19 '20 18:05

Schach21

1 Answers

net.zero_grad() sets the gradients of all its parameters (including parameters of submodules) to zero. If you call optim.zero_grad() that will do the same, but for all parameters that have been specified to be optimised. If you are using only net.parameters() in your optimiser, e.g. optim = Adam(net.parameters(), lr=1e-3), then both are equivalent, since they contain the exact same parameters.

You could have other parameters that are being optimised by the same optimiser, which are not part of net, in which case you would either have to manually set their gradients to zero and therefore keep track of all the parameters, or you can simply call optim.zero_grad() to ensure that all parameters that are being optimised, had their gradients set to zero.

Moreover, what happens if I do both?

Nothing, the gradients would just be set to zero again, but since they were already zero, it makes absolutely no difference.

If I do none, then the gradients get accumulated, but what does that exactly mean? do they get added?

Yes, they are being added to the existing gradients. In the backward pass the gradients in respect to every parameter are calculated, and then the gradient is added to the parameters' gradient (param.grad). That allows you to have multiple backward passes, that affect the same parameters, which would not be possible if the gradients were overwritten instead of being added.

For example, you could accumulate the gradients over multiple batches, if you need bigger batches for training stability but don't have enough memory to increase the batch size. This is trivial to achieve in PyTorch, which is essentially leaving off optim.zero_grad() and delaying optim.step() until you have gathered enough steps, as shown in HuggingFace - Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups.

That flexibility comes at the cost of having to manually set the gradients to zero. Frankly, one line is a very small cost to pay, even though many users won't make use of it and especially beginners might find it confusing.

178

answered Feb 10 '23 02:02

Michael Jungo

Related questions
                            
                                PyTorch : predict single example
                            
                                Converting state-parameters of Pytorch LSTM to Keras LSTM
                            
                                Why aren't torch.nn.Parameter listed when net is printed?
                            
                                Why there are different output between model.forward(input) and model(input)
                            
                                Iterate over two Pytorch tensors at once?
                            
                                What does "RuntimeError: CUDA error: device-side assert triggered" in PyTorch mean?
                            
                                Loss doesn't decrease in Pytorch CNN
                            
                                set `torch.backends.cudnn.benchmark = True` or not?
                            
                                PyTorch RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes' failed
                            
                                How to use groups parameter in PyTorch conv2d function
                            
                                Pytorch to Keras code equivalence
                            
                                How to take the average of the weights of two networks?
                            
                                TypeError: can't convert np.ndarray of type numpy.object_
                            
                                Is One-Hot Encoding required for using PyTorch's Cross Entropy Loss Function?
                            
                                PyTorch - Getting the 'TypeError: pic should be PIL Image or ndarray. Got <class 'numpy.ndarray'>' error
                            
                                How does one use Pytorch (+ cuda) with an A100 GPU?
                            
                                Optimizer and scheduler for BERT fine-tuning
                            
                                BatchNorm momentum convention PyTorch
                            
                                Why do we do batch matrix-matrix product?
                            
                                PyTorch: Dataloader for time series task

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With