Want to understand the difference in roles of tf.clip_by_value
and tf.clip_by_global_norm
during the implementation of Gradient Clipping in TensorFlow. Which one is preferred and how to decide the max value to clip on?
TensorFlow is open-source Python library designed by Google to develop Machine Learning models and deep learning neural networks. clip_by_value() is used to clip a Tensor values to a specified min and max. Syntax: tensorflow.clip_by_value( t, clip_value_min, clip_value_max, name )
Applying gradient clipping in TensorFlow models is quite straightforward. The only thing you need to do is pass the parameter to the optimizer function. All optimizers have a `clipnorm` and a `clipvalue` parameters that can be used to clip the gradients.
Gradient Clipping is a method where the error derivative is changed or clipped to a threshold during backward propagation through the network, and using the clipped gradients to update the weights.
TL;DR: use tf.clip_by_global_norm
for gradient clipping.
tf.clip_by_value
clips each value inside one tensor, regardless of the other values in the tensor. For instance,
tf.clip_by_value([-1, 2, 10], 0, 3) -> [0, 2, 3] # Only the values below 0 or above 3 are changed
Consequently, it can change the direction of the tensor, so it should be used if the values in the tensor are decorrelated one from another (which is not the case for gradient clipping), or to avoid zero / infinite values in a tensor that could lead to Nan / infinite values elsewhere (by clipping with a minimum of epsilon=1e-8 and a very big max value for instance).
tf.clip_by_norm
rescales one tensor if necessary, so that its L2 norm does not exceed a certain threshold. It's useful typically to avoid exploding gradient on one tensor, because you keep the gradient direction. For instance:
tf.clip_by_norm([-2, 3, 6], 5) -> [-2, 3, 6]*5/7 # The original L2 norm is 7, which is >5, so the final one is 5
tf.clip_by_norm([-2, 3, 6], 9) -> [-2, 3, 6] # The original L2 norm is 7, which is <9, so it is left unchanged
However, clip_by_norm
works on only one gradient, so if you use it on all your gradient tensors, you'll unbalance them (some will be rescaled, others not, and not all with the same scale).
Note that the two first ones work on only one tensor, while the last one is used on a list of tensors.
tf.clip_by_global_norm
rescales a list of tensors so that the total norm of the vector of all their norms does not exceed a threshold. The goal is the same as clip_by_norm
(avoid exploding gradient, keep the gradient directions), but it works on all the gradients at once rather than on each one separately (that is, all of them are rescaled by the same factor if necessary, or none of them are rescaled). This is better, because the balance between the different gradients is maintained.
For instance:
tf.clip_by_global_norm([tf.constant([-2, 3, 6]),tf.constant([-4, 6, 12])] , 14.5)
will rescale both tensors by a factor 14.5/sqrt(49 + 196)
, because the first tensor has a L2 norm of 7, the second one 14, and sqrt(7^2+ 14^2)>14.5
This (tf.clip_by_global_norm
) is the one that you should use for gradient clipping. See this for instance for more information.
Choosing the max value is the hardest part. You should use the biggest value such that you don't have exploding gradient (whose effects can be Nan
s or infinite
values appearing in your tensors, constant loss /accuracy after a few training steps). The value should be bigger for tf.clip_by_global_norm
than for the others, since the global L2 norm will be mechanically bigger than the other ones due to the number of tensors implied.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With