I have tried to train simple backpropagation neural network with the xor
function. When I use tanh(x)
as activation function, with the derivative 1-tanh(x)^2
, I get the right result after about 1000 iterations. However, when I use g(x) = 1/(1+e^(-x))
as an activation function, with the derivative g(x)*(1-g(x))
, I need about 50000 iterations to get the right result. What can be the reason?
Thank you.
Yes, what you observe is true. I have similar observations when training neural networks using back propagations. For XOR
problem, I used to set up a 2x20x2
network, logistic function takes 3000+ episodes to get below result:
[0, 0] -> [0.049170633762142486]
[0, 1] -> [0.947292007836417]
[1, 0] -> [0.9451808598939389]
[1, 1] -> [0.060643862846171494]
While using tanh
as activation function, here is the result after 800 episodes. tanh
converges consistently faster than logistic
.
[0, 0] -> [-0.0862215901296476]
[0, 1] -> [0.9777578145233919]
[1, 0] -> [0.9777632805205176]
[1, 1] -> [0.12637838259658932]
The two functions' shape look like below (credit: efficient backprop):
1/(1+e^(-x))
.tanh
function, also known as hyperbolic tangent. It's easy to see that tanh
is antisymmetric about the origin.
According to efficient Backprop,
Symmetric sigmoids such as
tanh
often converge faster than standard logistic function.
Also from wiki Logistic regression:
Practitioners caution that sigmoidal functions which are antisymmetric about the origin (e.g. the hyperbolic tangent) lead to faster convergence when training networks with backpropagation.
See efficient Backprop for more details explaining the intuition here.
See elliott for an alternative of tanh
with easier computations. It's shown below as the black curve (the blue one is the original tanh
).
Two things should stand out from the above chart. First, TANH usually needed fewer iterations to train than Elliott. So the training accuracy is not as good with Elliott, for an Encoder. However, notice the training times. Elliott completed its entire task, even with the extra iterations it had to do, in half the time of TANH. This is a huge improvement and literally means that in this case, Elliott will cut your training time in half, and deliver the same final training error. While it does take more training iterations to get there, the speed per iteration is so much faster it still results in the training time being cut in half.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With