I am using a Softmax activation function in the last layer of a neural network. But I have problems with a safe implementation of this function.
A naive implementation would be this one:
Vector y = mlp(x); // output of the neural network without softmax activation function
for(int f = 0; f < y.rows(); f++)
y(f) = exp(y(f));
y /= y.sum();
This does not work very well for > 100 hidden nodes because the y will be NaN
in many cases (if y(f) > 709, exp(y(f)) will return inf). I came up with this version:
Vector y = mlp(x); // output of the neural network without softmax activation function
for(int f = 0; f < y.rows(); f++)
y(f) = safeExp(y(f), y.rows());
y /= y.sum();
where safeExp
is defined as
double safeExp(double x, int div)
{
static const double maxX = std::log(std::numeric_limits<double>::max());
const double max = maxX / (double) div;
if(x > max)
x = max;
return std::exp(x);
}
This function limits the input of exp. In most of the cases this works but not in all cases and I did not really manage to find out in which cases it does not work. When I have 800 hidden neurons in the previous layer it does not work at all.
However, even if this worked I somehow "distort" the result of the ANN. Can you think of any other way to calculate the correct solution? Are there any C++ libraries or tricks that I can use to calculate the exact output of this ANN?
edit: The solution provided by Itamar Katz is:
Vector y = mlp(x); // output of the neural network without softmax activation function
double ymax = maximal component of y
for(int f = 0; f < y.rows(); f++)
y(f) = exp(y(f) - ymax);
y /= y.sum();
And it really is mathematically the same. In practice however, some small values become 0 because of the floating point precision. I wonder why nobody ever writes these implementation details down in textbooks.
The softmax function is used as the activation function in the output layer of neural network models that predict a multinomial probability distribution. That is, softmax is used as the activation function for multi-class classification problems where class membership is required on more than two class labels.
Softmax is an activation function that scales numbers/logits into probabilities. The output of a Softmax is a vector (say v ) with probabilities of each possible outcome. The probabilities in vector v sums to one for all possible outcomes or classes.
Calculates the softmax function. The softmax function is used in the activation function of the neural network. Softmax function σ(z)jσ(z)j=ezjK∑k=1ezkfor j=1,⋯,K. S o f t m a x f u n c t i o n σ ( z ) j σ ( z ) j = e z j ∑ k = 1 K e z k f o r j = 1 , ⋯ , K .
The common problem which can occur while applying softmax is the numeric stability problem, which means that the ∑j e^(z_j) may become very large due to the exponential and overflow error that may occur. This overflow error can be solved by subtracting each value of the array with its max value.
First go to log scale, i.e calculate log(y)
instead of y
. The log of the numerator is trivial. In order to calculate the log of the denominator, you can use the following 'trick': http://lingpipe-blog.com/2009/06/25/log-sum-of-exponentials/
I know it's already answered but I'll post here a step-by-step anyway.
put on log:
zj = wj . x + bj
oj = exp(zj)/sum_i{ exp(zi) }
log oj = zj - log sum_i{ exp(zi) }
Let m be the max_i { zi } use the log-sum-exp trick:
log oj = zj - log {sum_i { exp(zi + m - m)}}
= zj - log {sum_i { exp(m) exp(zi - m) }},
= zj - log {exp(m) sum_i {exp(zi - m)}}
= zj - m - log {sum_i { exp(zi - m)}}
the term exp(zi-m) can suffer underflow if m is much greater than other z_i, but that's ok since this means z_i is irrelevant on the softmax output after normalization. final results is:
oj = exp (zj - m - log{sum_i{exp(zi-m)}})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With