What is the difference between a sigmoid followed by the cross entropy and sigmoid_cross_entropy_with_logits in TensorFlow?

Tags:

When trying to get cross-entropy with sigmoid activation function, there is a difference between

loss1 = -tf.reduce_sum(p*tf.log(q), 1)
loss2 = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q),1)

But they are the same when with softmax activation function.

Following is the sample code:

import tensorflow as tf  sess2 = tf.InteractiveSession() p = tf.placeholder(tf.float32, shape=[None, 5]) logit_q = tf.placeholder(tf.float32, shape=[None, 5]) q = tf.nn.sigmoid(logit_q) sess.run(tf.global_variables_initializer())  feed_dict = {p: [[0, 0, 0, 1, 0], [1,0,0,0,0]], logit_q: [[0.2, 0.2, 0.2, 0.2, 0.2], [0.3, 0.3, 0.2, 0.1, 0.1]]} loss1 = -tf.reduce_sum(p*tf.log(q),1).eval(feed_dict) loss2 = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q),1).eval(feed_dict)  print(p.eval(feed_dict), "\n", q.eval(feed_dict)) print("\n",loss1, "\n", loss2)

700

asked Sep 19 '17 03:09

D.S.H.J

1 Answers

You're confusing the cross-entropy for binary and multi-class problems.

Multi-class cross-entropy

The formula that you use is correct and it directly corresponds to tf.nn.softmax_cross_entropy_with_logits:

-tf.reduce_sum(p * tf.log(q), axis=1)

p and q are expected to be probability distributions over N classes. In particular, N can be 2, as in the following example:

p = tf.placeholder(tf.float32, shape=[None, 2]) logit_q = tf.placeholder(tf.float32, shape=[None, 2]) q = tf.nn.softmax(logit_q)  feed_dict = {   p: [[0, 1],       [1, 0],       [1, 0]],   logit_q: [[0.2, 0.8],             [0.7, 0.3],             [0.5, 0.5]] }  prob1 = -tf.reduce_sum(p * tf.log(q), axis=1) prob2 = tf.nn.softmax_cross_entropy_with_logits(labels=p, logits=logit_q) print(prob1.eval(feed_dict))  # [ 0.43748799  0.51301527  0.69314718] print(prob2.eval(feed_dict))  # [ 0.43748799  0.51301527  0.69314718]

Note that q is computing tf.nn.softmax, i.e. outputs a probability distribution. So it's still multi-class cross-entropy formula, only for N = 2.

Binary cross-entropy

This time the correct formula is

p * -tf.log(q) + (1 - p) * -tf.log(1 - q)

Though mathematically it's a partial case of the multi-class case, the meaning of p and q is different. In the simplest case, each p and q is a number, corresponding to a probability of the class A.

Important: Don't get confused by the common p * -tf.log(q) part and the sum. Previous p was a one-hot vector, now it's a number, zero or one. Same for q - it was a probability distribution, now's it's a number (probability).

If p is a vector, each individual component is considered an independent binary classification. See this answer that outlines the difference between softmax and sigmoid functions in tensorflow. So the definition p = [0, 0, 0, 1, 0] doesn't mean a one-hot vector, but 5 different features, 4 of which are off and 1 is on. The definition q = [0.2, 0.2, 0.2, 0.2, 0.2] means that each of 5 features is on with 20% probability.

This explains the use of sigmoid function before the cross-entropy: its goal is to squash the logit to [0, 1] interval.

The formula above still holds for multiple independent features, and that's exactly what tf.nn.sigmoid_cross_entropy_with_logits computes:

p = tf.placeholder(tf.float32, shape=[None, 5]) logit_q = tf.placeholder(tf.float32, shape=[None, 5]) q = tf.nn.sigmoid(logit_q)  feed_dict = {   p: [[0, 0, 0, 1, 0],       [1, 0, 0, 0, 0]],   logit_q: [[0.2, 0.2, 0.2, 0.2, 0.2],             [0.3, 0.3, 0.2, 0.1, 0.1]] }  prob1 = -p * tf.log(q) prob2 = p * -tf.log(q) + (1 - p) * -tf.log(1 - q) prob3 = p * -tf.log(tf.sigmoid(logit_q)) + (1-p) * -tf.log(1-tf.sigmoid(logit_q)) prob4 = tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q) print(prob1.eval(feed_dict)) print(prob2.eval(feed_dict)) print(prob3.eval(feed_dict)) print(prob4.eval(feed_dict))

You should see that the last three tensors are equal, while the prob1 is only a part of cross-entropy, so it contains correct value only when p is 1:

[[ 0.          0.          0.          0.59813893  0.        ]  [ 0.55435514  0.          0.          0.          0.        ]] [[ 0.79813886  0.79813886  0.79813886  0.59813887  0.79813886]  [ 0.5543552   0.85435522  0.79813886  0.74439669  0.74439669]] [[ 0.7981388   0.7981388   0.7981388   0.59813893  0.7981388 ]  [ 0.55435514  0.85435534  0.7981388   0.74439663  0.74439663]] [[ 0.7981388   0.7981388   0.7981388   0.59813893  0.7981388 ]  [ 0.55435514  0.85435534  0.7981388   0.74439663  0.74439663]]

Now it should be clear that taking a sum of -p * tf.log(q) along axis=1 doesn't make sense in this setting, though it'd be a valid formula in multi-class case.

167

answered Oct 04 '22 02:10

Maxim

Related questions
                            
                                How to get mini-batches in pytorch in a clean and efficient way?
                            
                                How to install xgboost package in python (windows platform)?
                            
                                How to predict input image using trained model in Keras?
                            
                                TensorFlow: "Attempting to use uninitialized value" in variable initialization
                            
                                Scikit Learn - K-Means - Elbow - criterion
                            
                                How hard is it to implement a chess engine? [closed]
                            
                                Can neural networks approximate any function given enough hidden neurons?
                            
                                What is a projection layer in the context of neural networks?
                            
                                tag generation from a text content
                            
                                What is the inverse of regularization strength in Logistic Regression? How should it affect my code?
                            
                                plotting results of hierarchical clustering ontop of a matrix of data in python
                            
                                Normalize data before or after split of training and testing data?
                            
                                F1 Score vs ROC AUC
                            
                                What does calling fit() multiple times on the same model do?
                            
                                What is difference between tf.truncated_normal and tf.random_normal?
                            
                                Calculate AUC in R?
                            
                                Does the SVM in sklearn support incremental (online) learning?
                            
                                What is the number of filter in CNN?
                            
                                Octave : logistic regression : difference between fmincg and fminunc
                            
                                Normalize a feature in this table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the difference between a sigmoid followed by the cross entropy and sigmoid_cross_entropy_with_logits in TensorFlow?

Tags:

machine-learning

tensorflow

classification

cross-entropy

sigmoid