Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoiding numerical overflow when calculating the value AND gradient of the Logistic loss function

I am currently trying to implement a machine learning algorithm that involves the logistic loss function in MATLAB. Unfortunately, I am having some trouble due to numerical overflow.

In general, for a given an input s, the value of the logistic function is:

 log(1 + exp(s))

and the slope of the logistic loss function is:

 exp(s)./(1 + exp(s)) = 1./(1 + exp(-s))

In my algorithm, the value of s = X*beta. Here X is a matrix with N data points and P features per data point (i.e. size(X)=[N,P]) and beta is a vector of P coefficients for each feature such that size(beta)=[P 1].

I am specifically interested in calculating the average value and gradient of the Logistic function for given value of beta.

The average value of the Logistic function w.r.t to a value of beta is:

 L = 1/N * sum(log(1+exp(X*beta)),1)

The average value of the slope of the Logistic function w.r.t. to a value of b is:

 dL = 1/N * sum((exp(X*beta)./(1+exp(X*beta))' X, 1)'

Note that size(dL) = [P 1].

My issue is that these expressions keep producing numerical overflows. The problem effectively comes from the fact that exp(s)=Inf when s>1000 and exp(s)=0 when s<-1000.

I am looking for a solution such that s can take on any value in floating point arithmetic. Ideally, I would also really appreciate a solution that allows me to evaluate the value and gradient in a vectorized / efficient way.

like image 964
Berk U. Avatar asked Nov 20 '13 01:11

Berk U.


2 Answers

How about the following approximations:

– For computing L, if s is large, then exp(s) will be much larger than 1:

1 + exp(s) ≅ exp(s)

and consequently

log(1 + exp(s)) ≅ log(exp(s)) = s.

If s is small, then using the Taylor series of exp()

exp(s) ≅ 1 + s

and using the Taylor series of log()

log(1 + exp(s)) ≅ log(2 + s) ≅ log(2) + s / 2.

– For computing dL, for large s

exp(s) ./ (1 + exp(s)) ≅ 1

and for small s

exp(s) ./ (1 + exp(s)) ≅ 1/2 + s / 4.

– The code to compute L could look for example like this:

s = X*beta;
l = log(1+exp(s));
ind = isinf(l);
l(ind) = s(ind);
ind = (l == 0);
l(ind) = log(2) + s(ind) / 2;
L = 1/N * sum(l,1)
like image 79
A. Donda Avatar answered Nov 11 '22 03:11

A. Donda


I found a good article about this problem.

Cutting through a lot of words, we can simplify the argument to stating that the original expression

log(1 + exp(s)) 

can be rewritten as

log(exp(s)*(exp(-s) + 1))
= log(exp(s)) + log(exp(-s) + 1)
= s + log(exp(-s) + 1)

This stops overflow from occurring - it doesn't prevent underflow, but by the time that occurs, you have your answer (namely, s). You can't just use this instead of the original, since it will still give you problems. However, we now have the basis for a function that can be written that will be accurate and won't produce over/underflow:

function LL = logistic(s)
if s<0
  LL = log(1 + exp(s));
else
  LL = s + logistic(-s);

I think this maintains reasonably good accuracy.

EDIT now to the meat of your question - making this vectorized, and allowing the calculation of the slope as well. Let's take these one at a time:

function LL = logisticVec(s)
  LL = zeros(size(s));
  LL(s<0) = log(1 + exp(s(s<0)));
  LL(s>=0) = s(s>=0) + log(1 + exp(-s(s>=0)));

To obtain the average you wanted:

L = logisticVec(X*beta) / N;

The slope is a little bit trickier; note I believe you may have a typo in your expression (missing a multiplication sign).

dL/dbeta = sum(X * exp(X*beta) ./ (1 + exp(X*beta))) / N;

If we divide top and bottom by exp(X*beta) we get

dL = sum(X ./ (exp(-X*beta) + 1)) / N;

Once again, the overflow has gone away and we are left with underflow - but since the underflowed value has 1 added to it, the error this creates is insignificant.

like image 5
Floris Avatar answered Nov 11 '22 04:11

Floris