Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between tf.clip_by_value and tf.clip_by_global_norm for RNN's and how to decide max value to clip on?

Want to understand the difference in roles of tf.clip_by_value and tf.clip_by_global_norm during the implementation of Gradient Clipping in TensorFlow. Which one is preferred and how to decide the max value to clip on?

like image 495
Vishnu Sriram Avatar asked Jun 28 '17 08:06

Vishnu Sriram


People also ask

What is TF clip by value?

TensorFlow is open-source Python library designed by Google to develop Machine Learning models and deep learning neural networks. clip_by_value() is used to clip a Tensor values to a specified min and max. Syntax: tensorflow.clip_by_value( t, clip_value_min, clip_value_max, name )

How do you implement gradient clipping in TensorFlow?

Applying gradient clipping in TensorFlow models is quite straightforward. The only thing you need to do is pass the parameter to the optimizer function. All optimizers have a `clipnorm` and a `clipvalue` parameters that can be used to clip the gradients.

What is gradient clipping?

Gradient Clipping is a method where the error derivative is changed or clipped to a threshold during backward propagation through the network, and using the clipped gradients to update the weights.


1 Answers

TL;DR: use tf.clip_by_global_norm for gradient clipping.

clip_by_value

tf.clip_by_value clips each value inside one tensor, regardless of the other values in the tensor. For instance,

tf.clip_by_value([-1, 2, 10], 0, 3)  -> [0, 2, 3]  # Only the values below 0 or above 3 are changed

Consequently, it can change the direction of the tensor, so it should be used if the values in the tensor are decorrelated one from another (which is not the case for gradient clipping), or to avoid zero / infinite values in a tensor that could lead to Nan / infinite values elsewhere (by clipping with a minimum of epsilon=1e-8 and a very big max value for instance).

clip_by_norm

tf.clip_by_norm rescales one tensor if necessary, so that its L2 norm does not exceed a certain threshold. It's useful typically to avoid exploding gradient on one tensor, because you keep the gradient direction. For instance:

tf.clip_by_norm([-2, 3, 6], 5)  -> [-2, 3, 6]*5/7  # The original L2 norm is 7, which is >5, so the final one is 5
tf.clip_by_norm([-2, 3, 6], 9)  -> [-2, 3, 6]  # The original L2 norm is 7, which is <9, so it is left unchanged

However, clip_by_norm works on only one gradient, so if you use it on all your gradient tensors, you'll unbalance them (some will be rescaled, others not, and not all with the same scale).

Note that the two first ones work on only one tensor, while the last one is used on a list of tensors.

clip_by_global_norm

tf.clip_by_global_norm rescales a list of tensors so that the total norm of the vector of all their norms does not exceed a threshold. The goal is the same as clip_by_norm (avoid exploding gradient, keep the gradient directions), but it works on all the gradients at once rather than on each one separately (that is, all of them are rescaled by the same factor if necessary, or none of them are rescaled). This is better, because the balance between the different gradients is maintained.

For instance:

tf.clip_by_global_norm([tf.constant([-2, 3, 6]),tf.constant([-4, 6, 12])] , 14.5)

will rescale both tensors by a factor 14.5/sqrt(49 + 196), because the first tensor has a L2 norm of 7, the second one 14, and sqrt(7^2+ 14^2)>14.5

This (tf.clip_by_global_norm) is the one that you should use for gradient clipping. See this for instance for more information.

Choosing the value

Choosing the max value is the hardest part. You should use the biggest value such that you don't have exploding gradient (whose effects can be Nans or infinite values appearing in your tensors, constant loss /accuracy after a few training steps). The value should be bigger for tf.clip_by_global_norm than for the others, since the global L2 norm will be mechanically bigger than the other ones due to the number of tensors implied.

like image 145
gdelab Avatar answered Oct 01 '22 11:10

gdelab