Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the intuition of using tanh in LSTM? [closed]

In an LSTM network (Understanding LSTMs), why does the input gate and output gate use tanh?

What is the intuition behind this?

It is just a nonlinear transformation? If it is, can I change both to another activation function (e.g., ReLU)?

like image 466
DNK Avatar asked Nov 23 '16 10:11

DNK


People also ask

Why does LSTM use tanh and sigmoid?

The function is effectively tanh(x)*sigmoid(y) because inputs to each activation function can be radically different. The intuition is that the LSTM can learn relatively "hard" switches to classify when the sigmoid function should be 0 or 1 (depending on the gate function and input data).

Why tanh function is used?

Most of the times Tanh function is usually used in hidden layers of a neural network because its values lies between -1 to 1 that's why the mean for the hidden layer comes out be 0 or its very close to 0, hence tanh functions helps in centering the data by bringing mean close to 0 which makes learning for the next ...

What is the best activation function for LSTM?

Traditionally, LSTMs use the tanh activation function for the activation of the cell state and the sigmoid activation function for the node output. Given their careful design, ReLU were thought to not be appropriate for Recurrent Neural Networks (RNNs) such as the Long Short-Term Memory Network (LSTM) by default.

What does tanh do in neural network?

The hyperbolic tangent activation function is also referred to simply as the Tanh (also “tanh” and “TanH“) function. It is very similar to the sigmoid activation function and even has the same S-shape. The function takes any real value as input and outputs values in the range -1 to 1.


2 Answers

Sigmoid specifically, is used as the gating function for the three gates (in, out, and forget) in LSTM, since it outputs a value between 0 and 1, and it can either let no flow or complete flow of information throughout the gates.

On the other hand, to overcome the vanishing gradient problem, we need a function whose second derivative can sustain for a long range before going to zero. Tanh is a good function with the above property.

A good neuron unit should be bounded, easily differentiable, monotonic (good for convex optimization) and easy to handle. If you consider these qualities, then I believe you can use ReLU in place of the tanh function since they are very good alternatives of each other.

But before making a choice for activation functions, you must know what the advantages and disadvantages of your choice over others are. I am shortly describing some of the activation functions and their advantages.

Sigmoid

Mathematical expression: sigmoid(z) = 1 / (1 + exp(-z))

First-order derivative: sigmoid'(z) = -exp(-z) / 1 + exp(-z)^2

Advantages:

(1) The sigmoid function has all the fundamental properties of a good activation function. 

Tanh

Mathematical expression: tanh(z) = [exp(z) - exp(-z)] / [exp(z) + exp(-z)]

First-order derivative: tanh'(z) = 1 - ([exp(z) - exp(-z)] / [exp(z) + exp(-z)])^2 = 1 - tanh^2(z)

Advantages:

(1) Often found to converge faster in practice (2) Gradient computation is less expensive 

Hard Tanh

Mathematical expression: hardtanh(z) = -1 if z < -1; z if -1 <= z <= 1; 1 if z > 1

First-order derivative: hardtanh'(z) = 1 if -1 <= z <= 1; 0 otherwise

Advantages:

(1) Computationally cheaper than Tanh (2) Saturate for magnitudes of z greater than 1 

ReLU

Mathematical expression: relu(z) = max(z, 0)

First-order derivative: relu'(z) = 1 if z > 0; 0 otherwise

Advantages:

(1) Does not saturate even for large values of z (2) Found much success in computer vision applications 

Leaky ReLU

Mathematical expression: leaky(z) = max(z, k dot z) where 0 < k < 1

First-order derivative: relu'(z) = 1 if z > 0; k otherwise

Advantages:

(1) Allows propagation of error for non-positive z which ReLU doesn't 

This paper explains some fun activation function. You may consider to read it.

like image 148
Wasi Ahmad Avatar answered Oct 11 '22 06:10

Wasi Ahmad


LSTMs manage an internal state vector whose values should be able to increase or decrease when we add the output of some function. Sigmoid output is always non-negative; values in the state would only increase. The output from tanh can be positive or negative, allowing for increases and decreases in the state.

That's why tanh is used to determine candidate values to get added to the internal state. The GRU cousin of the LSTM doesn't have a second tanh, so in a sense the second one is not necessary. Check out the diagrams and explanations in Chris Olah's Understanding LSTM Networks for more.

The related question, "Why are sigmoids used in LSTMs where they are?" is also answered based on the possible outputs of the function: "gating" is achieved by multiplying by a number between zero and one, and that's what sigmoids output.

There aren't really meaningful differences between the derivatives of sigmoid and tanh; tanh is just a rescaled and shifted sigmoid: see Richard Socher's Neural Tips and Tricks. If second derivatives are relevant, I'd like to know how.

like image 41
Aaron Schumacher Avatar answered Oct 11 '22 08:10

Aaron Schumacher