When trying to get cross-entropy with sigmoid activation function, there is a difference between
loss1 = -tf.reduce_sum(p*tf.log(q), 1)
loss2 = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q),1)
But they are the same when with softmax activation function.
Following is the sample code:
import tensorflow as tf sess2 = tf.InteractiveSession() p = tf.placeholder(tf.float32, shape=[None, 5]) logit_q = tf.placeholder(tf.float32, shape=[None, 5]) q = tf.nn.sigmoid(logit_q) sess.run(tf.global_variables_initializer()) feed_dict = {p: [[0, 0, 0, 1, 0], [1,0,0,0,0]], logit_q: [[0.2, 0.2, 0.2, 0.2, 0.2], [0.3, 0.3, 0.2, 0.1, 0.1]]} loss1 = -tf.reduce_sum(p*tf.log(q),1).eval(feed_dict) loss2 = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q),1).eval(feed_dict) print(p.eval(feed_dict), "\n", q.eval(feed_dict)) print("\n",loss1, "\n", loss2)
While sigmoid_cross_entropy_with_logits works for soft binary labels (probabilities between 0 and 1), it can also be used for binary classification where the labels are hard.
Cross-entropy is commonly used in machine learning as a loss function. Cross-entropy is a measure from the field of information theory, building upon entropy and generally calculating the difference between two probability distributions.
Softmax is used for multi-classification in the Logistic Regression model, whereas Sigmoid is used for binary classification in the Logistic Regression model.
You're confusing the cross-entropy for binary and multi-class problems.
The formula that you use is correct and it directly corresponds to tf.nn.softmax_cross_entropy_with_logits
:
-tf.reduce_sum(p * tf.log(q), axis=1)
p
and q
are expected to be probability distributions over N classes. In particular, N can be 2, as in the following example:
p = tf.placeholder(tf.float32, shape=[None, 2]) logit_q = tf.placeholder(tf.float32, shape=[None, 2]) q = tf.nn.softmax(logit_q) feed_dict = { p: [[0, 1], [1, 0], [1, 0]], logit_q: [[0.2, 0.8], [0.7, 0.3], [0.5, 0.5]] } prob1 = -tf.reduce_sum(p * tf.log(q), axis=1) prob2 = tf.nn.softmax_cross_entropy_with_logits(labels=p, logits=logit_q) print(prob1.eval(feed_dict)) # [ 0.43748799 0.51301527 0.69314718] print(prob2.eval(feed_dict)) # [ 0.43748799 0.51301527 0.69314718]
Note that q
is computing tf.nn.softmax
, i.e. outputs a probability distribution. So it's still multi-class cross-entropy formula, only for N = 2.
This time the correct formula is
p * -tf.log(q) + (1 - p) * -tf.log(1 - q)
Though mathematically it's a partial case of the multi-class case, the meaning of p
and q
is different. In the simplest case, each p
and q
is a number, corresponding to a probability of the class A.
Important: Don't get confused by the common p * -tf.log(q)
part and the sum. Previous p
was a one-hot vector, now it's a number, zero or one. Same for q
- it was a probability distribution, now's it's a number (probability).
If p
is a vector, each individual component is considered an independent binary classification. See this answer that outlines the difference between softmax and sigmoid functions in tensorflow. So the definition p = [0, 0, 0, 1, 0]
doesn't mean a one-hot vector, but 5 different features, 4 of which are off and 1 is on. The definition q = [0.2, 0.2, 0.2, 0.2, 0.2]
means that each of 5 features is on with 20% probability.
This explains the use of sigmoid
function before the cross-entropy: its goal is to squash the logit to [0, 1]
interval.
The formula above still holds for multiple independent features, and that's exactly what tf.nn.sigmoid_cross_entropy_with_logits
computes:
p = tf.placeholder(tf.float32, shape=[None, 5]) logit_q = tf.placeholder(tf.float32, shape=[None, 5]) q = tf.nn.sigmoid(logit_q) feed_dict = { p: [[0, 0, 0, 1, 0], [1, 0, 0, 0, 0]], logit_q: [[0.2, 0.2, 0.2, 0.2, 0.2], [0.3, 0.3, 0.2, 0.1, 0.1]] } prob1 = -p * tf.log(q) prob2 = p * -tf.log(q) + (1 - p) * -tf.log(1 - q) prob3 = p * -tf.log(tf.sigmoid(logit_q)) + (1-p) * -tf.log(1-tf.sigmoid(logit_q)) prob4 = tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q) print(prob1.eval(feed_dict)) print(prob2.eval(feed_dict)) print(prob3.eval(feed_dict)) print(prob4.eval(feed_dict))
You should see that the last three tensors are equal, while the prob1
is only a part of cross-entropy, so it contains correct value only when p
is 1
:
[[ 0. 0. 0. 0.59813893 0. ] [ 0.55435514 0. 0. 0. 0. ]] [[ 0.79813886 0.79813886 0.79813886 0.59813887 0.79813886] [ 0.5543552 0.85435522 0.79813886 0.74439669 0.74439669]] [[ 0.7981388 0.7981388 0.7981388 0.59813893 0.7981388 ] [ 0.55435514 0.85435534 0.7981388 0.74439663 0.74439663]] [[ 0.7981388 0.7981388 0.7981388 0.59813893 0.7981388 ] [ 0.55435514 0.85435534 0.7981388 0.74439663 0.74439663]]
Now it should be clear that taking a sum of -p * tf.log(q)
along axis=1
doesn't make sense in this setting, though it'd be a valid formula in multi-class case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With