I understand that binary cross-entropy is the same as categorical cross-entropy in case of two classes.
Further, it is clear for me what softmax is.
Therefore, I see that categorical cross-entropy just penalizes the one component (probability) that should be 1.
But why, can't or shouldn't I use binary cross-entropy on a one-hot vector?
Normal Case for 1-Label-Multiclass-Mutual-exclusivity-classification:
################
pred = [0.1 0.3 0.2 0.4]
label (one hot) = [0 1 0 0]
costfunction: categorical crossentropy
= sum(label * -log(pred)) //just consider the 1-label
= 0.523
Why not that?
################
pred = [0.1 0.3 0.2 0.4]
label (one hot) = [0 1 0 0]
costfunction: binary crossentropy
= sum(- label * log(pred) - (1 - label) * log(1 - pred))
= 1*-log(0.3)-log(1-0.1)-log(1-0.2)-log(1-0.4)
= 0.887
I see that in binary cross-entropy the zero is a target class, and corresponds to the following one-hot encoding:
target class zero 0 -> [1 0]
target class one 1 -> [0 1]
In summary: Why do we just calculate/summarize the negative log likelihood for the predicted class. Why don't we penalize the other SHOULD-BE-ZERO-/NOT-THAT-CLASS classes?
In case one uses binary cross-entropy to a one-hot vector. Probabilities to expected zero labels would be penalized too.
See my answer on a similar question. In short, binary cross-entropy formula doesn't make sense for the one-hot vector. It's either possible to apply softmax cross-entropy for two or more classes or use the vector of (independent) probabilities in label
, depending on the task.
But why, can't or shouldn't I use binary crossentropy on a one-hot vector?
What you compute is binary cross-entropy of 4 independent features:
pred = [0.1 0.3 0.2 0.4]
label = [0 1 0 0]
The model inference predicted that first feature is on with 10% probability, the second feature is on with 30% probability and so on. Target label is interpreted this way: all features are off, except for the second one. Note that [1, 1, 1, 1]
is a perfectly valid label as well, i.e. it's not one-hot vector, and pred=[0.5, 0.8, 0.7, 0.1]
is a valid prediction, i.e. the sum doesn't have to equal to one.
In other words, your computation is valid, but for a completely different problem: multi-label non-exclusive binary classification.
See also the difference between softmax and sigmoid cross-entropy loss functions in tensorflow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With