In order to make the case simple and intuitive, I will using binary (0 and 1) classification for illustration.
Loss function
loss = np.multiply(np.log(predY), Y) + np.multiply((1 - Y), np.log(1 - predY)) #cross entropy
cost = -np.sum(loss)/m #num of examples in batch is m
Probability of Y
predY
is computed using sigmoid and logits
can be thought as the outcome of from a neural network before reaching the classification step
predY = sigmoid(logits) #binary case
def sigmoid(X):
return 1/(1 + np.exp(-X))
Problem
Suppose we are running a feed-forward net.
Inputs: [3, 5]: 3 is number of examples and 5 is feature size (fabricated data)
Num of hidden units: 100 (only 1 hidden layer)
Iterations: 10000
Such arrangement is set to overfit. When it's overfitting, we can perfectly predict the probability for the training examples; in other words, sigmoid outputs either 1 or 0, exact number because the exponential gets exploded. If this is the case, we would have np.log(0)
undefined. How do you usually handle this issue?
It turns out that the formulation of cross-entropy between two probability distributions coincides with the negative log-likelihood. However, as implemented in PyTorch, the CrossEntropyLoss expects raw prediction values while the NLLLoss expects log probabilities.
Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label.
Equation 1: Definition of Entropy. Note log is calculated to base 2. Reason for negative sign: log(p(x))<0 for all p(x) in (0,1) . p(x) is a probability distribution and therefore the values must range between 0 and 1.
Binary cross entropy compares each of the predicted probabilities to actual class output which can be either 0 or 1.
How do you usually handle this issue?
Add small number (something like 1e-15) to predY
- this number doesn't make predictions much off, and it solves log(0) issue.
BTW if your algorithm outputs zeros and ones it might be useful to check the histogram of returned probabilities - when algorithm is so sure that something's happening it can be a sign of overfitting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With