Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Neural Activation Functions - Difference between Logistic / Tanh / etc

I'm writing some basic neural network methods - specifically the activation functions - and have hit the limits of my rubbish knowledge of math. I understand the respective ranges (-1/1) (0/1) etc, but the varying descriptions and implementations have me confused.

Specifically sigmoid, logistic, bipolar sigmoid, tanh, etc.

Does sigmoid simply describe the shape of the function irrespective of range? If so, then is tanh a 'sigmoid function'?

I have seen 'bipolar sigmoid' compared against 'tanh' in a paper, however I have seen both functions implemented (in various libraries) with the same code:

(( 2/ (1 + Exp(-2 * n))) - 1). Are they exactly the same thing?

Likewise, I have seen logistic and sigmoid activations implemented with the same code:

( 1/ (1 + Exp(-1 * n))). Are these also equivalent?

Lastly, does it even matter that much in practise? I see on wiki a plot of very similar sigmoid functions - could any of these be used? Some look like they may be considerably faster to compute than others.

like image 537
Satellite Avatar asked Aug 07 '12 13:08

Satellite


People also ask

Is tanh a logistic function?

Its outputs range from 0 to 1, and are often interpreted as probabilities (in, say, logistic regression). The tanh function, a.k.a. hyperbolic tangent function, is a rescaling of the logistic sigmoid, such that its outputs range from -1 to 1.

How many types of activation functions are there in neural networks?

6 Types of Activation Function in Neural Networks You Need to Know.


2 Answers

Logistic function: ex/(ex + ec)

Special ("standard") case of the logistic function: 1/(1 + e-x)

Bipolar sigmoid: never heard of it.

Tanh: (ex-e-x)/(ex + e-x)

Sigmoid usually refers to the shape (and limits), so yes, tanh is a sigmoid function. But in some contexts it refers specifically to the standard logistic function, so you have to be careful. And yes, you could use any sigmoid function and probably do just fine.

(( 2/ (1 + Exp(-2 * x))) - 1) is equivalent to tanh(x).

like image 94
Beta Avatar answered Oct 13 '22 04:10

Beta


Generally the most important differences are a. smooth continuously differentiable like tanh and logistic vs step or truncated b. competitive vs transfer c. sigmoid vs radial d. symmetric (-1,+1) vs asymmetric (0,1)

Generally the differentiable requirement is needed for hidden layers and tanh is often recommended as being more balanced. The 0 for tanh is at the fastest point (highest gradient or gain) and not a trap, while for logistic 0 is the lowest point and a trap for anything pushing deeper into negative territory. Radial (basis) functions are about distance from a typical prototype and good for convex circular regions about a neuron, while the sigmoid functions are about separating linearly and good for half spaces - and it will require many for good approximation to a convex region, with circular/spherical regions being worst for sigmoids and best for radials.

Generally, the recommendation is for tanh on the intermediate layers for +/- balance, and suit the output layer to the task (boolean/dichotomous class decision with threshold, logistic or competitive outputs (e.g. softmax, a self-normalizing multiclass generalization of logistic); regression tasks can even be linear). The output layer doesn't need to be continuously differentiable. The input layer should be normalized in some way, either to [0,1] or better still standardization or normalization with demeaning to [-1,+1]. If you include a dummy input of 1 then normalize so ||x||p = 1 you are dividing by a sum or length and this magnitude information is retained in the dummy bias input rather than being lost. If you normalize over examples, this is technically interfering with your test data if you look at them, or they may be out of range if you don't. But with ||2 normalization such variations or errors should approach the normal distribution if they are effects of natural distribution or error. This means that they with high probability they won't exceed the original range (probably around 2 standard deviations) by more than a small factor (viz. such overrange values are regarded as outliers and not significant).

So I recommend unbiased instance normalization or biased pattern standardization or both on the input layer (possibly with data reduction with SVD), tanh on the hidden layers, and a threshold function, logistic function or competitive function on the output for classification, but linear with unnormalized targets or perhaps logsig with normalized targets for regression.

like image 40
David M W Powers Avatar answered Oct 13 '22 04:10

David M W Powers