Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do tensorflow and keras SimpleRNN layers have a default activation of tanh

I want to use a relu activation for my simple RNN in a tensorflow model I am building. It sits on top of a deep convolutional network. I am trying to classify a sequence of images. I noticed that the default activation in both keras and tensorflow source code is tanh for simple RNNs. Is there a reason for this? Is there anything wrong with using relu? It seems like relu would help better with the vanishing gradients.

nn = tf.nn.rnn_cell.BasicRNNCell(1024, activation = tf.nn.relu)

like image 834
chasep255 Avatar asked Aug 27 '16 11:08

chasep255


People also ask

What is the default activation function for Simplernn?

activation: Activation function to use. Default: hyperbolic tangent ( tanh ).

Why do we use tanh in Lstm?

The output from tanh can be positive or negative, allowing for increases and decreases in the state. That's why tanh is used to determine candidate values to get added to the internal state. The GRU cousin of the LSTM doesn't have a second tanh, so in a sense the second one is not necessary.

What is the default activation function of dense layer?

Activation. This parameter sets the element-wise activation function to be used in the dense layer. By default, we can see that it is set to None. That means that by default it is a linear activation.


1 Answers

RNNs can suffer from both exploding gradient and vanishing gradient problems. When the sequence to learn is long, then this can be a very delicate balance tipping into one or the other quite easily. Both problems are caused by exponentiation - each layer multiplies by weight matrix and derivative of activation, so if either the matrix magnitude or activation derivative is different from 1.0, there will be a tendency towards exploding or vanishing.

ReLUs do not help with exploding gradient problems. In fact they can be worse than activation functions which are naturally limited when weights are large such as sigmoid or tanh.

ReLUs do help with vanishing gradient problems. However, the designs of LSTM and GRU cells are also intended to address the same problem (of dealing with learning from potentially weak signals many time steps away), and do so very effectively.

For a simple RNN with short time series, there should be nothing wrong working with ReLU activation. To address the possibility of exploding gradients when training, you could look at gradient clipping (treating gradients outside of allowed range as being the min or max of that range).

like image 192
Neil Slater Avatar answered Nov 15 '22 07:11

Neil Slater