I want to use a relu activation for my simple RNN in a tensorflow model I am building. It sits on top of a deep convolutional network. I am trying to classify a sequence of images. I noticed that the default activation in both keras and tensorflow source code is tanh for simple RNNs. Is there a reason for this? Is there anything wrong with using relu? It seems like relu would help better with the vanishing gradients.
nn = tf.nn.rnn_cell.BasicRNNCell(1024, activation = tf.nn.relu)
activation: Activation function to use. Default: hyperbolic tangent ( tanh ).
The output from tanh can be positive or negative, allowing for increases and decreases in the state. That's why tanh is used to determine candidate values to get added to the internal state. The GRU cousin of the LSTM doesn't have a second tanh, so in a sense the second one is not necessary.
Activation. This parameter sets the element-wise activation function to be used in the dense layer. By default, we can see that it is set to None. That means that by default it is a linear activation.
RNNs can suffer from both exploding gradient and vanishing gradient problems. When the sequence to learn is long, then this can be a very delicate balance tipping into one or the other quite easily. Both problems are caused by exponentiation - each layer multiplies by weight matrix and derivative of activation, so if either the matrix magnitude or activation derivative is different from 1.0, there will be a tendency towards exploding or vanishing.
ReLUs do not help with exploding gradient problems. In fact they can be worse than activation functions which are naturally limited when weights are large such as sigmoid or tanh.
ReLUs do help with vanishing gradient problems. However, the designs of LSTM and GRU cells are also intended to address the same problem (of dealing with learning from potentially weak signals many time steps away), and do so very effectively.
For a simple RNN with short time series, there should be nothing wrong working with ReLU activation. To address the possibility of exploding gradients when training, you could look at gradient clipping (treating gradients outside of allowed range as being the min or max of that range).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With