In an LSTM network (Understanding LSTMs), why does the input gate and output gate use tanh? What is the intuition behind this? It is just a nonlinear transformation? If it is, can I change both to another activation function (e.g., ReLU)?

Sigmoid specifically, is used as the gating function for the three gates (in, out, and forget) in LSTM, since it outputs a value between 0 and 1, and it can either let no flow or complete flow of information throughout the gates. On the other hand, to overcome the vanishing gradient problem, we need a function whose second derivative can sustain for a long range before going to zero. <code>Tanh</code> is a good function with the above property. A good neuron unit should be bounded, easily differentiable, monotonic (good for convex optimization) and easy to handle. If you consider these qualities, then I believe you can use <code>ReLU</code> in place of the <code>tanh</code> function since they are very good alternatives of each other. But before making a choice for activation functions, you must know what the advantages and disadvantages of your choice over others are. I am shortly describing some of the activation functions and their advantages. Sigmoid Mathematical expression: <code>sigmoid(z) = 1 / (1 + exp(-z))</code> First-order derivative: <code>sigmoid'(z) = -exp(-z) / 1 + exp(-z)^2</code> Advantages: <pre class="prettyprint"><code>(1) The sigmoid function has all the fundamental properties of a good activation function. </code></pre> Tanh Mathematical expression: <code>tanh(z) = [exp(z) - exp(-z)] / [exp(z) + exp(-z)]</code> First-order derivative: <code>tanh'(z) = 1 - ([exp(z) - exp(-z)] / [exp(z) + exp(-z)])^2 = 1 - tanh^2(z)</code> Advantages: <pre class="prettyprint"><code>(1) Often found to converge faster in practice (2) Gradient computation is less expensive </code></pre> Hard Tanh Mathematical expression: <code>hardtanh(z) = -1 if z < -1; z if -1 <= z <= 1; 1 if z > 1</code> First-order derivative: <code>hardtanh'(z) = 1 if -1 <= z <= 1; 0 otherwise</code> Advantages: <pre class="prettyprint"><code>(1) Computationally cheaper than Tanh (2) Saturate for magnitudes of z greater than 1 </code></pre> ReLU Mathematical expression: <code>relu(z) = max(z, 0)</code> First-order derivative: <code>relu'(z) = 1 if z > 0; 0 otherwise</code> Advantages: <pre class="prettyprint"><code>(1) Does not saturate even for large values of z (2) Found much success in computer vision applications </code></pre> Leaky ReLU Mathematical expression: <code>leaky(z) = max(z, k dot z) where 0 < k < 1</code> First-order derivative: <code>relu'(z) = 1 if z > 0; k otherwise</code> Advantages: <pre class="prettyprint"><code>(1) Allows propagation of error for non-positive z which ReLU doesn't </code></pre> This paper explains some fun activation function. You may consider to read it.

What is the intuition of using tanh in LSTM? [closed]

2 Answers

Sigmoid specifically, is used as the gating function for the three gates (in, out, and forget) in LSTM, since it outputs a value between 0 and 1, and it can either let no flow or complete flow of information throughout the gates.

On the other hand, to overcome the vanishing gradient problem, we need a function whose second derivative can sustain for a long range before going to zero. Tanh is a good function with the above property.

A good neuron unit should be bounded, easily differentiable, monotonic (good for convex optimization) and easy to handle. If you consider these qualities, then I believe you can use ReLU in place of the tanh function since they are very good alternatives of each other.

But before making a choice for activation functions, you must know what the advantages and disadvantages of your choice over others are. I am shortly describing some of the activation functions and their advantages.

Sigmoid

Mathematical expression: sigmoid(z) = 1 / (1 + exp(-z))

First-order derivative: sigmoid'(z) = -exp(-z) / 1 + exp(-z)^2

Advantages:

(1) The sigmoid function has all the fundamental properties of a good activation function.

Tanh

Mathematical expression: tanh(z) = [exp(z) - exp(-z)] / [exp(z) + exp(-z)]

First-order derivative: tanh'(z) = 1 - ([exp(z) - exp(-z)] / [exp(z) + exp(-z)])^2 = 1 - tanh^2(z)

Advantages:

(1) Often found to converge faster in practice (2) Gradient computation is less expensive

Hard Tanh

Mathematical expression: hardtanh(z) = -1 if z < -1; z if -1 <= z <= 1; 1 if z > 1

First-order derivative: hardtanh'(z) = 1 if -1 <= z <= 1; 0 otherwise

Advantages:

(1) Computationally cheaper than Tanh (2) Saturate for magnitudes of z greater than 1

ReLU

Mathematical expression: relu(z) = max(z, 0)

First-order derivative: relu'(z) = 1 if z > 0; 0 otherwise

Advantages:

(1) Does not saturate even for large values of z (2) Found much success in computer vision applications

Leaky ReLU

Mathematical expression: leaky(z) = max(z, k dot z) where 0 < k < 1

First-order derivative: relu'(z) = 1 if z > 0; k otherwise

Advantages:

(1) Allows propagation of error for non-positive z which ReLU doesn't

This paper explains some fun activation function. You may consider to read it.

148

answered Oct 11 '22 06:10

Wasi Ahmad

LSTMs manage an internal state vector whose values should be able to increase or decrease when we add the output of some function. Sigmoid output is always non-negative; values in the state would only increase. The output from tanh can be positive or negative, allowing for increases and decreases in the state.

That's why tanh is used to determine candidate values to get added to the internal state. The GRU cousin of the LSTM doesn't have a second tanh, so in a sense the second one is not necessary. Check out the diagrams and explanations in Chris Olah's Understanding LSTM Networks for more.

The related question, "Why are sigmoids used in LSTMs where they are?" is also answered based on the possible outputs of the function: "gating" is achieved by multiplying by a number between zero and one, and that's what sigmoids output.

There aren't really meaningful differences between the derivatives of sigmoid and tanh; tanh is just a rescaled and shifted sigmoid: see Richard Socher's Neural Tips and Tricks. If second derivatives are relevant, I'd like to know how.

answered Oct 11 '22 08:10

Aaron Schumacher

Related questions
                            
                                What is the relation between the number of Support Vectors and training data and classifiers performance? [closed]
                            
                                How to update the bias in neural network backpropagation?
                            
                                What's the difference between torch.stack() and torch.cat() functions?
                            
                                How to detect patterns in (electrocardiography) waves?
                            
                                How to write a confusion matrix in Python?
                            
                                How big should batch size and number of epochs be when fitting a model in Keras?
                            
                                What's the difference between a bidirectional LSTM and an LSTM?
                            
                                How do I find Wally with Python?
                            
                                TensorFlow, "'module' object has no attribute 'placeholder'"
                            
                                Unsupervised clustering with unknown number of clusters
                            
                                How to tell Keras stop training based on loss value?
                            
                                How can I build a model to distinguish tweets about Apple (Inc.) from tweets about apple (fruit)?
                            
                                How to implement the ReLU function in Numpy
                            
                                Scikit-learn: How to obtain True Positive, True Negative, False Positive and False Negative
                            
                                pytorch - connection between loss.backward() and optimizer.step()
                            
                                Sentiment analysis for Twitter in Python [closed]
                            
                                Recovering features names of explained_variance_ratio_ in PCA with sklearn
                            
                                Accuracy Score ValueError: Can't Handle mix of binary and continuous target
                            
                                cocktail party algorithm SVD implementation ... in one line of code?
                            
                                scikit-learn .predict() default threshold

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the intuition of using tanh in LSTM? [closed]

Tags:

machine-learning

deep-learning

lstm

recurrent-neural-network

activation-function

DNK

People also ask

2 Answers

Wasi Ahmad

Aaron Schumacher

Recent Activity

Donate For Us