<p>Want to understand the difference in roles of <code>tf.clip_by_value</code> and <code>tf.clip_by_global_norm</code> during the implementation of Gradient Clipping in TensorFlow. Which one is preferred and how to decide the max value to clip on?</p>

<p><strong>TL;DR</strong>: use <code>tf.clip_by_global_norm</code> for gradient clipping.</p> <h3>clip_by_value</h3> <p><code>tf.clip_by_value</code> clips each value inside one tensor, regardless of the other values in the tensor. For instance,</p> <pre class="prettyprint"><code>tf.clip_by_value([-1, 2, 10], 0, 3) -> [0, 2, 3] # Only the values below 0 or above 3 are changed </code></pre> <p>Consequently, it can change the direction of the tensor, so it should be used if the values in the tensor are decorrelated one from another (which is not the case for gradient clipping), or to avoid zero / infinite values in a tensor that could lead to Nan / infinite values elsewhere (by clipping with a minimum of epsilon=1e-8 and a very big max value for instance).</p> <h3>clip_by_norm</h3> <p><code>tf.clip_by_norm</code> rescales one tensor if necessary, so that its L2 norm does not exceed a certain threshold. It's useful typically to avoid exploding gradient on one tensor, because you keep the gradient direction. For instance:</p> <pre class="prettyprint"><code>tf.clip_by_norm([-2, 3, 6], 5) -> [-2, 3, 6]*5/7 # The original L2 norm is 7, which is >5, so the final one is 5 tf.clip_by_norm([-2, 3, 6], 9) -> [-2, 3, 6] # The original L2 norm is 7, which is <9, so it is left unchanged </code></pre> <p>However, <code>clip_by_norm</code> works on only one gradient, so if you use it on all your gradient tensors, you'll unbalance them (some will be rescaled, others not, and not all with the same scale).</p> <p>Note that the two first ones work on only one tensor, while the last one is used on a list of tensors.</p> <h3>clip_by_global_norm</h3> <p><code>tf.clip_by_global_norm</code> rescales a list of tensors so that the total norm of the vector of all their norms does not exceed a threshold. The goal is the same as <code>clip_by_norm</code> (avoid exploding gradient, keep the gradient directions), but it works on all the gradients at once rather than on each one separately (that is, all of them are rescaled by the same factor if necessary, or none of them are rescaled). This is better, because the balance between the different gradients is maintained.</p> <p>For instance: </p> <pre class="prettyprint"><code>tf.clip_by_global_norm([tf.constant([-2, 3, 6]),tf.constant([-4, 6, 12])] , 14.5) </code></pre> <p>will rescale both tensors by a factor <code>14.5/sqrt(49 + 196)</code>, because the first tensor has a L2 norm of 7, the second one 14, and <code>sqrt(7^2+ 14^2)>14.5</code></p> <p>This (<code>tf.clip_by_global_norm</code>) is the one that you should use for gradient clipping. See this for instance for more information.</p> <h3>Choosing the value</h3> <p>Choosing the max value is the hardest part. You should use the biggest value such that you don't have exploding gradient (whose effects can be <code>Nan</code>s or <code>infinite</code> values appearing in your tensors, constant loss /accuracy after a few training steps). The value should be bigger for <code>tf.clip_by_global_norm</code> than for the others, since the global L2 norm will be mechanically bigger than the other ones due to the number of tensors implied.</p>

Difference between tf.clip_by_value and tf.clip_by_global_norm for RNN's and how to decide max value to clip on?

Tags:

python

tensorflow

deep-learning

Want to understand the difference in roles of tf.clip_by_value and tf.clip_by_global_norm during the implementation of Gradient Clipping in TensorFlow. Which one is preferred and how to decide the max value to clip on?

495

asked Jun 28 '17 08:06

Vishnu Sriram

1 Answers

TL;DR: use tf.clip_by_global_norm for gradient clipping.

clip_by_value

tf.clip_by_value clips each value inside one tensor, regardless of the other values in the tensor. For instance,

tf.clip_by_value([-1, 2, 10], 0, 3)  -> [0, 2, 3]  # Only the values below 0 or above 3 are changed

Consequently, it can change the direction of the tensor, so it should be used if the values in the tensor are decorrelated one from another (which is not the case for gradient clipping), or to avoid zero / infinite values in a tensor that could lead to Nan / infinite values elsewhere (by clipping with a minimum of epsilon=1e-8 and a very big max value for instance).

clip_by_norm

tf.clip_by_norm rescales one tensor if necessary, so that its L2 norm does not exceed a certain threshold. It's useful typically to avoid exploding gradient on one tensor, because you keep the gradient direction. For instance:

tf.clip_by_norm([-2, 3, 6], 5)  -> [-2, 3, 6]*5/7  # The original L2 norm is 7, which is >5, so the final one is 5
tf.clip_by_norm([-2, 3, 6], 9)  -> [-2, 3, 6]  # The original L2 norm is 7, which is <9, so it is left unchanged

However, clip_by_norm works on only one gradient, so if you use it on all your gradient tensors, you'll unbalance them (some will be rescaled, others not, and not all with the same scale).

Note that the two first ones work on only one tensor, while the last one is used on a list of tensors.

clip_by_global_norm

tf.clip_by_global_norm rescales a list of tensors so that the total norm of the vector of all their norms does not exceed a threshold. The goal is the same as clip_by_norm (avoid exploding gradient, keep the gradient directions), but it works on all the gradients at once rather than on each one separately (that is, all of them are rescaled by the same factor if necessary, or none of them are rescaled). This is better, because the balance between the different gradients is maintained.

For instance:

tf.clip_by_global_norm([tf.constant([-2, 3, 6]),tf.constant([-4, 6, 12])] , 14.5)

will rescale both tensors by a factor 14.5/sqrt(49 + 196), because the first tensor has a L2 norm of 7, the second one 14, and sqrt(7^2+ 14^2)>14.5

This (tf.clip_by_global_norm) is the one that you should use for gradient clipping. See this for instance for more information.

Choosing the value

Choosing the max value is the hardest part. You should use the biggest value such that you don't have exploding gradient (whose effects can be Nans or infinite values appearing in your tensors, constant loss /accuracy after a few training steps). The value should be bigger for tf.clip_by_global_norm than for the others, since the global L2 norm will be mechanically bigger than the other ones due to the number of tensors implied.

145

answered Oct 01 '22 11:10

gdelab

Related questions
                            
                                Working with binary PNG images in PIL/pillow
                            
                                Webhooks for slot filling
                            
                                Construct python dict from DeepDiff result
                            
                                Determine the window size turtle python setup
                            
                                Resolve a variable name given only a stack frame object
                            
                                Python Pillow's thumbnail method returning None
                            
                                TypeError: string indices must be integers (Python) [duplicate]
                            
                                Should I ever directly call object.__str__()?
                            
                                Get the positive and negative words from a Textblob based on its polarity in Python (Sentimental analysis)
                            
                                Pyinstaller : program that reads a csv
                            
                                Vectorized pythonic way to get count of elements greater than current element
                            
                                Combine 'toc' and 'hide input' when using nbconvert html export
                            
                                Permission Error: Using Image.open
                            
                                How to resize Moviepy to fullscreen?
                            
                                Confused on a for loop for a hangman game?
                            
                                Why is pip installing Pillow for OS X 10.12, when I have OS X 10.11 installed?
                            
                                Parallel threads with TensorFlow Dataset API and flat_map
                            
                                How does data normalization work in keras during prediction?
                            
                                ElementTree iterparse strategy
                            
                                Detecting Mouse clicks in windows using python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With