Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why use tanh for activation function of MLP?

Tags:

Im personally studying theories of neural network and got some questions.

In many books and references, for activation function of hidden layer, hyper-tangent functions were used.

Books came up with really simple reason that linear combinations of tanh functions can describe nearly all shape of functions with given error.

But, there came a question.

  1. Is this a real reason why tanh function is used?
  2. If then, is it the only reason why tanh function is used?
  3. if then, is tanh function the only function that can do that?
  4. if not, what is the real reason?..

I stock here keep thinking... please help me out of this mental(?...) trap!

like image 402
forsythia Avatar asked Jun 18 '14 09:06

forsythia


People also ask

Why do Lstms use tanh?

In LSTM network, tanh activation function is used to determine candidate cell state (internal state) values ( \tilde{C}_{t} ) and update the hidden state ( h_{t} ).

Which activation function is used for MLP?

Multilayer Perceptron (MLP): ReLU activation function. Convolutional Neural Network (CNN): ReLU activation function.

Why is tanh used in neural networks?

Hyperbolic Tangent Function (Tanh) The biggest advantage of the tanh function is that it produces a zero-centered output, thereby supporting the backpropagation process. The tanh function has been mostly used in recurrent neural networks for natural language processing and speech recognition tasks.

Why do we use non-linear activation function such as ReLU tanh in neural networks?

Why do we need Non-linear activation functions :- A neural network without an activation function is essentially just a linear regression model. The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks.


2 Answers

Most of time tanh is quickly converge than sigmoid and logistic function, and performs better accuracy [1]. However, recently rectified linear unit (ReLU) is proposed by Hinton [2] which shows ReLU train six times fast than tanh [3] to reach same training error. And you can refer to [4] to see what benefits ReLU provides.


Accordining to about 2 years machine learning experience. I want to share some stratrgies the most paper used and my experience about computer vision.

Normalizing input is very important

Normalizing well could get better performance and converge quickly. Most of time we will subtract mean value to make input mean to be zero to prevent weights change same directions so that converge slowly [5] .Recently google also points that phenomenon as internal covariate shift out when training deep learning, and they proposed batch normalization [6] so as to normalize each vector having zero mean and unit variance.

More data more accuracy

More training data could generize feature space well and prevent overfitting. In computer vision if training data is not enough, most of used skill to increase training dataset is data argumentation and synthesis training data.

Choosing a good activation function allows training better and efficiently.

ReLU nonlinear acitivation worked better and performed state-of-art results in deep learning and MLP. Moreover, it has some benefits e.g. simple to implementation and cheaper computation in back-propagation to efficiently train more deep neural net. However, ReLU will get zero gradient and do not train when the unit is zero active. Hence some modified ReLUs are proposed e.g. Leaky ReLU, and Noise ReLU, and most popular method is PReLU [7] proposed by Microsoft which generalized the traditional recitifed unit.

Others

  • choose large initial learning rate if it will not oscillate or diverge so as to find a better global minimum.
  • shuffling data
like image 133
RyanLiu Avatar answered Oct 14 '22 22:10

RyanLiu


In truth both tanh and logistic functions can be used. The idea is that you can map any real number ( [-Inf, Inf] ) to a number between [-1 1] or [0 1] for the tanh and logistic respectively. In this way, it can be shown that a combination of such functions can approximate any non-linear function. Now regarding the preference for the tanh over the logistic function is that the first is symmetric regarding the 0 while the second is not. This makes the second one more prone to saturation of the later layers, making training more difficult.

like image 32
ASantosRibeiro Avatar answered Oct 14 '22 21:10

ASantosRibeiro