I am following this tutorial on RNN where on line 177 the following code is executed.
max_grad_norm = 10
....
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), max_grad_norm)
optimizer = tf.train.GradientDescentOptimizer(self.lr)
self._train_op = optimizer.apply_gradients(zip(grads, tvars),
global_step=tf.contrib.framework.get_or_create_global_step())
Why do we do clip_by_global_norm
? How is the value of max_grad_norm
decided?
Gradient clipping ensures the gradient vector g has norm at most c. This helps gradient descent to have a reasonable behaviour even if the loss landscape of the model is irregular. The following figure shows an example with an extremely steep cliff in the loss landscape.
Gradient Clipping Another popular technique to mitigate the exploding gradients problem is to clip the gradients during backpropagation so that they never exceed some threshold. This is called Gradient Clipping. This optimizer will clip every component of the gradient vector to a value between –1.0 and 1.0.
In general, exploding gradients can be avoided by carefully configuring the network model, such as using a small learning rate, scaling the target variables, and using a standard loss function. However, in recurrent networks with a large number of input time steps, exploding gradients may still be an issue.
nn. utils. clip_grad_norm_ performs gradient clipping. It is used to mitigate the problem of exploding gradients, which is of particular concern for recurrent networks (which LSTMs are a type of).
The reason for clipping the norm is that otherwise it may explode:
There are two widely known issues with properly training recurrent neural networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients
The above taken from this paper.
In terms of how to set max_grad_norm
, you could play with it a bit to see how it affects your results. This is usually set to quite small number (I have seen 5 in several cases). Note that tensorflow does not force you to specify this value. If you don't it will specify it itself (as explained in the documentation).
The reason that exploding\vanishing gradient is common in rnn is because while doing backpropagation (this is called backpropagation through time), we will need to multiply the gradient matrices all the way to t=0
(that is, if we currently at t=100
, say the 100's character in a sentence, we will need to multiply 100 matrices). Here is the equation for t=3
:
(this equation is taken from here)
If the norm of the matrices is bigger than 1, it will eventually explode. It it is smaller that 1, it will eventually vanish. This may happen in usual neural networks as well if they have a lot of hidden layers. However, feed forward neural networks usually don't have so many hidden layers, while the input sequences to rnn can easily have many characters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With