Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do we clip_by_global_norm to obtain gradients while performing RNN

Tags:

tensorflow

I am following this tutorial on RNN where on line 177 the following code is executed.

max_grad_norm = 10
....
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), max_grad_norm)
optimizer = tf.train.GradientDescentOptimizer(self.lr)
self._train_op = optimizer.apply_gradients(zip(grads, tvars),
   global_step=tf.contrib.framework.get_or_create_global_step())

Why do we do clip_by_global_norm? How is the value of max_grad_norm decided?

like image 875
suku Avatar asked Apr 22 '17 16:04

suku


People also ask

Why do we need gradient clipping?

Gradient clipping ensures the gradient vector g has norm at most c. This helps gradient descent to have a reasonable behaviour even if the loss landscape of the model is irregular. The following figure shows an example with an extremely steep cliff in the loss landscape.

What approaches are taken to deal with the problem of exploding gradient in RNN?

Gradient Clipping Another popular technique to mitigate the exploding gradients problem is to clip the gradients during backpropagation so that they never exceed some threshold. This is called Gradient Clipping. This optimizer will clip every component of the gradient vector to a value between –1.0 and 1.0.

How do we know that gradients are exploding How do we prevent it?

In general, exploding gradients can be avoided by carefully configuring the network model, such as using a small learning rate, scaling the target variables, and using a standard loss function. However, in recurrent networks with a large number of input time steps, exploding gradients may still be an issue.

Why do we need to use torch nn utils Clip_grad_norm_ in training?

nn. utils. clip_grad_norm_ performs gradient clipping. It is used to mitigate the problem of exploding gradients, which is of particular concern for recurrent networks (which LSTMs are a type of).


1 Answers

The reason for clipping the norm is that otherwise it may explode:

There are two widely known issues with properly training recurrent neural networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients

The above taken from this paper.

In terms of how to set max_grad_norm, you could play with it a bit to see how it affects your results. This is usually set to quite small number (I have seen 5 in several cases). Note that tensorflow does not force you to specify this value. If you don't it will specify it itself (as explained in the documentation).

The reason that exploding\vanishing gradient is common in rnn is because while doing backpropagation (this is called backpropagation through time), we will need to multiply the gradient matrices all the way to t=0 (that is, if we currently at t=100, say the 100's character in a sentence, we will need to multiply 100 matrices). Here is the equation for t=3:

enter image description here

(this equation is taken from here)

If the norm of the matrices is bigger than 1, it will eventually explode. It it is smaller that 1, it will eventually vanish. This may happen in usual neural networks as well if they have a lot of hidden layers. However, feed forward neural networks usually don't have so many hidden layers, while the input sequences to rnn can easily have many characters.

like image 58
Miriam Farber Avatar answered Dec 28 '22 23:12

Miriam Farber