Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Activation function for multilayer perceptron

I have tried to train simple backpropagation neural network with the xor function. When I use tanh(x) as activation function, with the derivative 1-tanh(x)^2, I get the right result after about 1000 iterations. However, when I use g(x) = 1/(1+e^(-x)) as an activation function, with the derivative g(x)*(1-g(x)), I need about 50000 iterations to get the right result. What can be the reason?

Thank you.

like image 297
user1767774 Avatar asked Jan 15 '23 10:01

user1767774


1 Answers

Yes, what you observe is true. I have similar observations when training neural networks using back propagations. For XOR problem, I used to set up a 2x20x2 network, logistic function takes 3000+ episodes to get below result:

[0, 0] -> [0.049170633762142486]
[0, 1] -> [0.947292007836417]
[1, 0] -> [0.9451808598939389]
[1, 1] -> [0.060643862846171494]

While using tanh as activation function, here is the result after 800 episodes. tanh converges consistently faster than logistic.

[0, 0] -> [-0.0862215901296476]
[0, 1] -> [0.9777578145233919]
[1, 0] -> [0.9777632805205176]
[1, 1] -> [0.12637838259658932]

The two functions' shape look like below (credit: efficient backprop):

activation funcs

  • The left is the standard logistic function: 1/(1+e^(-x)).
  • The right is the tanh function, also known as hyperbolic tangent.

It's easy to see that tanh is antisymmetric about the origin.

According to efficient Backprop,

Symmetric sigmoids such as tanh often converge faster than standard logistic function.

Also from wiki Logistic regression:

Practitioners caution that sigmoidal functions which are antisymmetric about the origin (e.g. the hyperbolic tangent) lead to faster convergence when training networks with backpropagation.

See efficient Backprop for more details explaining the intuition here.

See elliott for an alternative of tanh with easier computations. It's shown below as the black curve (the blue one is the original tanh).

enter image description here

enter image description here

Two things should stand out from the above chart. First, TANH usually needed fewer iterations to train than Elliott. So the training accuracy is not as good with Elliott, for an Encoder. However, notice the training times. Elliott completed its entire task, even with the extra iterations it had to do, in half the time of TANH. This is a huge improvement and literally means that in this case, Elliott will cut your training time in half, and deliver the same final training error. While it does take more training iterations to get there, the speed per iteration is so much faster it still results in the training time being cut in half.

like image 162
greeness Avatar answered Feb 05 '23 21:02

greeness