When reviewing through the Sigmoid function that is used in Neural Nets, we found this equation from https://en.wikipedia.org/wiki/Softmax_function#Softmax_Normalization:
Different from the standard sigmoid equation:
The first equation on top somehow involves the mean and standard deviation (I hope I didn't read the symbols wrongly) whereas the 2nd equation generalized the minus mean and divided by standard deviation as a constant since it's the same throughout all terms within a vector/matrix/tensor.
So when implementing the equations, I get different results.
With the 2nd equation (standard sigmoid function):
def sigmoid(x):
return 1. / (1 + np.exp(-x))
I get these output:
>>> x = np.array([1,2,3])
>>> print sigmoid(x)
[ 0.73105858 0.88079708 0.95257413]
I would have expect the 1st function to be the similar but the gap between the first and second element widens by quite a bit (though the ranking of the elements remains:
def get_statistics(x):
n = float(len(x))
m = x.sum() / n
s2 = sum((x - m)**2) / (n-1.)
s = s2**0.5
return m, s2, s
m, s, s2 = get_statistics(x)
sigmoid_x1 = 1 / (1 + np.exp(-(x[0] - m) / s2))
sigmoid_x2 = 1 / (1 + np.exp(-(x[1] - m) / s2))
sigmoid_x3 = 1 / (1 + np.exp(-(x[2] - m) / s2))
sigmoid_x1, sigmoid_x2, sigmoid_x3
[out]:
(0.2689414213699951, 0.5, 0.7310585786300049)
Possibly it has to do with the fact that the first equation contains some sort of softmax normalization but if it's generic softmax then the elements need to sum to one as such:
def softmax(x):
exp_x = np.exp(x)
return exp_x / exp_x.sum()
[out]:
>>> x = np.array([1,2,3])
>>> print softmax(x)
[ 0.09003057 0.24472847 0.66524096]
But the output from the first equation don't sum to one and it isn't similar/same as the standard sigmoid equation. So the question is:
You have implemented the equations correctly. Your problem is that you are mixing up the definitions of softmax and sigmoid functions.
A softmax function is a way to normalize your data by making outliers "less interesting". Additionally, it "squashes" your input vector in a way that it ensures the sum of the vector to be 1.
For your example:
> np.sum([ 0.09003057, 0.24472847, 0.66524096])
> 1.0
It is simply a generalization of a logistic function with the additional "constraint" to get every element of the vector in the interval (0, 1) and its sum to 1.0.
The sigmoid function is another special case of logistic functions. It is just a real-valued, differentiable function with a bell shape. It is interesting for neural networks because it is rather easy to compute, non-linear and has negative and positive boundaries, so your activation can not diverge but runs into saturation if it gets "too high".
However, a sigmoid function is not ensuring that an input vector sums up to 1.0.
In neural networks, sigmoid functions are used frequently as an activation function for single neurons, while a sigmoid/softmax normalization function is rather used at the output layer, to ensure the whole layer adds up to 1. You just mixed up the sigmoid function (for single neurons) versus the sigmoid/softmax normalization functions (for a whole layer).
EDIT: To clearify this for you I will give you an easy example with outliers, this demonstrates the behaviour of the two different functions for you.
Let's implement a sigmoid function:
import numpy as np
def s(x):
return 1.0 / (1.0 + np.exp(-x))
And the normalized version (in little steps, making it easier to read):
def sn(x):
numerator = x - np.mean(x)
denominator = np.std(x)
fraction = numerator / denominator
return 1.0 / (1.0 + np.exp(-fraction))
Now we define some measurements of something with huge outliers:
measure = np.array([0.01, 0.2, 0.5, 0.6, 0.7, 1.0, 2.5, 5.0, 50.0, 5000.0])
Now we take a look at the results that s
(sigmoid) and sn
(normalized sigmoid) give:
> s(measure)
> array([ 0.50249998, 0.549834 , 0.62245933, 0.64565631, 0.66818777,
0.73105858, 0.92414182, 0.99330715, 1. , 1. ])
> sn(measure)
> array([ 0.41634425, 0.41637507, 0.41642373, 0.41643996, 0.41645618,
0.41650485, 0.41674821, 0.41715391, 0.42447515, 0.9525677 ])
As you can see, s
only translates the values "one-by-one" via a logistic function, so the outliers are fully satured with 0.999, 1.0, 1.0. The distance between the other values varies.
When we look at sn
we see that the function actually normalized our values. Everything now is extremely identical, except for 0.95 which was the 5000.0.
What is this good for or how to interpret this?
Think of an output layer in a neural network: an activation of 5000.0 in one class on an output layer (compared to our other small values) means that the network is really sure that this is the "right" class to your given input. If you would have used s
there, you would end up with 0.99, 1.0 and 1.0 and would not be able to distinguish which class is the correct guess for your input.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With