What is the correct way to perform gradient clipping in pytorch?
I have an exploding gradients problem.
Using gradient clipping you can prevent exploding gradients in neural networks. Gradient clipping limits the magnitude of the gradient. There are many ways to compute gradient clipping, but a common one is to rescale gradients so that their norm is at most a particular value.
Gradient clipping-by-norm The idea behind clipping-by-norm is similar to by-value. The difference is that we clip the gradients by multiplying the unit vector of the gradients with the threshold. where the threshold is a hyperparameter, g is the gradient, and ‖g‖ is the norm of g.
Applying gradient clipping in TensorFlow models is quite straightforward. The only thing you need to do is pass the parameter to the optimizer function. All optimizers have a `clipnorm` and a `clipvalue` parameters that can be used to clip the gradients.
Vanishing gradients can happen when optimization gets stuck at a certain point because the gradient is too small to progress. Gradient clipping can prevent these issues in the gradients that mess up the parameters during training.
A more complete example from here:
optimizer.zero_grad()
loss, hidden = model(data, hidden, targets)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
optimizer.step()
clip_grad_norm
(which is actually deprecated in favor of clip_grad_norm_
following the more consistent syntax of a trailing _
when in-place modification is performed) clips the norm of the overall gradient by concatenating all parameters passed to the function, as can be seen from the documentation:
The norm is computed over all gradients together, as if they were concatenated into a single vector. Gradients are modified in-place.
From your example it looks like that you want clip_grad_value_
instead which has a similar syntax and also modifies the gradients in-place:
clip_grad_value_(model.parameters(), clip_value)
Another option is to register a backward hook. This takes the current gradient as an input and may return a tensor which will be used in-place of the previous gradient, i.e. modifying it. This hook is called each time after a gradient has been computed, i.e. there's no need for manually clipping once the hook has been registered:
for p in model.parameters():
p.register_hook(lambda grad: torch.clamp(grad, -clip_value, clip_value))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With