Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementation of a softmax activation function for neural networks

Tags:

I am using a Softmax activation function in the last layer of a neural network. But I have problems with a safe implementation of this function.

A naive implementation would be this one:

Vector y = mlp(x); // output of the neural network without softmax activation function
for(int f = 0; f < y.rows(); f++)
  y(f) = exp(y(f));
y /= y.sum();

This does not work very well for > 100 hidden nodes because the y will be NaN in many cases (if y(f) > 709, exp(y(f)) will return inf). I came up with this version:

Vector y = mlp(x); // output of the neural network without softmax activation function
for(int f = 0; f < y.rows(); f++)
  y(f) = safeExp(y(f), y.rows());
y /= y.sum();

where safeExp is defined as

double safeExp(double x, int div)
{
  static const double maxX = std::log(std::numeric_limits<double>::max());
  const double max = maxX / (double) div;
  if(x > max)
    x = max;
  return std::exp(x);
}

This function limits the input of exp. In most of the cases this works but not in all cases and I did not really manage to find out in which cases it does not work. When I have 800 hidden neurons in the previous layer it does not work at all.

However, even if this worked I somehow "distort" the result of the ANN. Can you think of any other way to calculate the correct solution? Are there any C++ libraries or tricks that I can use to calculate the exact output of this ANN?

edit: The solution provided by Itamar Katz is:

Vector y = mlp(x); // output of the neural network without softmax activation function
double ymax = maximal component of y
for(int f = 0; f < y.rows(); f++)
  y(f) = exp(y(f) - ymax);
y /= y.sum();

And it really is mathematically the same. In practice however, some small values become 0 because of the floating point precision. I wonder why nobody ever writes these implementation details down in textbooks.

like image 684
alfa Avatar asked Mar 28 '12 11:03

alfa


People also ask

What is softmax function in neural network?

The softmax function is used as the activation function in the output layer of neural network models that predict a multinomial probability distribution. That is, softmax is used as the activation function for multi-class classification problems where class membership is required on more than two class labels.

What does softmax activation do?

Softmax is an activation function that scales numbers/logits into probabilities. The output of a Softmax is a vector (say v ) with probabilities of each possible outcome. The probabilities in vector v sums to one for all possible outcomes or classes.

How is Softmax activation function calculated?

Calculates the softmax function. The softmax function is used in the activation function of the neural network. Softmax function σ(z)jσ(z)j=ezjK∑k=1ezkfor j=1,⋯,K. S o f t m a x f u n c t i o n σ ( z ) j σ ( z ) j = e z j ∑ k = 1 K e z k f o r j = 1 , ⋯ , K .

What is the problem of implementing the softmax function?

The common problem which can occur while applying softmax is the numeric stability problem, which means that the ∑j e^(z_j) may become very large due to the exponential and overflow error that may occur. This overflow error can be solved by subtracting each value of the array with its max value.


2 Answers

First go to log scale, i.e calculate log(y) instead of y. The log of the numerator is trivial. In order to calculate the log of the denominator, you can use the following 'trick': http://lingpipe-blog.com/2009/06/25/log-sum-of-exponentials/

like image 56
Itamar Katz Avatar answered Sep 21 '22 09:09

Itamar Katz


I know it's already answered but I'll post here a step-by-step anyway.

put on log:

zj = wj . x + bj
oj = exp(zj)/sum_i{ exp(zi) }
log oj = zj - log sum_i{ exp(zi) }

Let m be the max_i { zi } use the log-sum-exp trick:

log oj = zj - log {sum_i { exp(zi + m - m)}}
   = zj - log {sum_i { exp(m) exp(zi - m) }},
   = zj - log {exp(m) sum_i {exp(zi - m)}}
   = zj - m - log {sum_i { exp(zi - m)}}

the term exp(zi-m) can suffer underflow if m is much greater than other z_i, but that's ok since this means z_i is irrelevant on the softmax output after normalization. final results is:

oj = exp (zj - m - log{sum_i{exp(zi-m)}})
like image 37
Khanis Rok Avatar answered Sep 18 '22 09:09

Khanis Rok