Why do we clip_by_global_norm to obtain gradients while performing RNN

Tags:

tensorflow

I am following this tutorial on RNN where on line 177 the following code is executed.

max_grad_norm = 10
....
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), max_grad_norm)
optimizer = tf.train.GradientDescentOptimizer(self.lr)
self._train_op = optimizer.apply_gradients(zip(grads, tvars),
   global_step=tf.contrib.framework.get_or_create_global_step())

Why do we do clip_by_global_norm? How is the value of max_grad_norm decided?

875

asked Apr 22 '17 16:04

1 Answers

The reason for clipping the norm is that otherwise it may explode:

There are two widely known issues with properly training recurrent neural networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients

The above taken from this paper.

In terms of how to set max_grad_norm, you could play with it a bit to see how it affects your results. This is usually set to quite small number (I have seen 5 in several cases). Note that tensorflow does not force you to specify this value. If you don't it will specify it itself (as explained in the documentation).

The reason that exploding\vanishing gradient is common in rnn is because while doing backpropagation (this is called backpropagation through time), we will need to multiply the gradient matrices all the way to t=0 (that is, if we currently at t=100, say the 100's character in a sentence, we will need to multiply 100 matrices). Here is the equation for t=3:

enter image description here

(this equation is taken from here)

If the norm of the matrices is bigger than 1, it will eventually explode. It it is smaller that 1, it will eventually vanish. This may happen in usual neural networks as well if they have a lot of hidden layers. However, feed forward neural networks usually don't have so many hidden layers, while the input sequences to rnn can easily have many characters.

answered Dec 28 '22 23:12

Miriam Farber

Related questions
                            
                                Behavior of Dropout layers in test / training phase
                            
                                Can I use `tf.nn.dropout` to implement DropConnect?
                            
                                How to restore weights with different names but same shapes Tensorflow?
                            
                                How to Have Multiple Softmax Outputs in Tensorflow?
                            
                                How to make predictions with tf.estimator.Estimator from checkpoint?
                            
                                How does the `my_input_fn` in the getting started with TensorFlow allow enumeration over the data?
                            
                                Is there any way to stop training a model in Keras after a certain accuracy has been achieved?
                            
                                Tensorflow Eager and Tensorboard Graphs?
                            
                                Keras: Accuracy Drops While Finetuning Inception
                            
                                Why is random number generator tf.random_uniform in tensorflow much faster than the numpy equivalent
                            
                                Efficiently resize batch of np.array images
                            
                                How to save Keras model as frozen graph?
                            
                                ValueError: Data cardinality is ambiguous
                            
                                Basic neural network in TensorFlow
                            
                                tensorflow: saving and restoring session
                            
                                Building a simple image search using TensorFlow
                            
                                Max margin loss in TensorFlow
                            
                                Tensorflow's gradient_override_map function
                            
                                Batch-major vs time-major LSTM
                            
                                How to use pretrained Word2Vec model in Tensorflow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why do we clip_by_global_norm to obtain gradients while performing RNN

Tags:

tensorflow

suku

People also ask

1 Answers

Miriam Farber

Recent Activity

Donate For Us