How to choose cross-entropy loss in TensorFlow?

Q: Is higher or lower cross-entropy better?

The use of negative logs on probabilities is what is known as the cross-entropy, where a high number means bad models and a low number means a good model.

Q: What is cross-entropy loss function most suitable for?

Cross-entropy is widely used as a loss function when optimizing classification models. Two examples that you may encounter include the logistic regression algorithm (a linear classification algorithm), and artificial neural networks that can be used for classification tasks.

Q: What is cross-entropy in Tensorflow?

Cross entropy can be used to define a loss function (cost function) in machine learning and optimization. It is defined on probability distributions, not single values. It works for classification because classifier output is (often) a probability distribution over class labels.

Tags:

python

neural-network

tensorflow

cross-entropy

logistic-regression

Classification problems, such as logistic regression or multinomial logistic regression, optimize a cross-entropy loss. Normally, the cross-entropy layer follows the softmax layer, which produces probability distribution.

In tensorflow, there are at least a dozen of different cross-entropy loss functions:

tf.losses.softmax_cross_entropy
tf.losses.sparse_softmax_cross_entropy
tf.losses.sigmoid_cross_entropy
tf.contrib.losses.softmax_cross_entropy
tf.contrib.losses.sigmoid_cross_entropy
tf.nn.softmax_cross_entropy_with_logits
tf.nn.sigmoid_cross_entropy_with_logits
...

Which one works only for binary classification and which are suitable for multi-class problems? When should you use sigmoid instead of softmax? How are sparse functions different from others and why is it only softmax?

Related (more math-oriented) discussion: What are the differences between all these cross-entropy losses in Keras and TensorFlow?.

766

asked Oct 31 '17 11:10

Maxim

1 Answers

Preliminary facts

In functional sense, the sigmoid is a partial case of the softmax function, when the number of classes equals 2. Both of them do the same operation: transform the logits (see below) to probabilities.

In simple binary classification, there's no big difference between the two, however in case of multinomial classification, sigmoid allows to deal with non-exclusive labels (a.k.a. multi-labels), while softmax deals with exclusive classes (see below).
A logit (also called a score) is a raw unscaled value associated with a class, before computing the probability. In terms of neural network architecture, this means that a logit is an output of a dense (fully-connected) layer.

Tensorflow naming is a bit strange: all of the functions below accept logits, not probabilities, and apply the transformation themselves (which is simply more efficient).

Sigmoid functions family

tf.nn.sigmoid_cross_entropy_with_logits
tf.nn.weighted_cross_entropy_with_logits
tf.losses.sigmoid_cross_entropy
tf.contrib.losses.sigmoid_cross_entropy (DEPRECATED)

As stated earlier, sigmoid loss function is for binary classification. But tensorflow functions are more general and allow to do multi-label classification, when the classes are independent. In other words, tf.nn.sigmoid_cross_entropy_with_logits solves N binary classifications at once.

The labels must be one-hot encoded or can contain soft class probabilities.

tf.losses.sigmoid_cross_entropy in addition allows to set the in-batch weights, i.e. make some examples more important than others. tf.nn.weighted_cross_entropy_with_logits allows to set class weights (remember, the classification is binary), i.e. make positive errors larger than negative errors. This is useful when the training data is unbalanced.

Softmax functions family

tf.nn.softmax_cross_entropy_with_logits (DEPRECATED IN 1.5)
tf.nn.softmax_cross_entropy_with_logits_v2
tf.losses.softmax_cross_entropy
tf.contrib.losses.softmax_cross_entropy (DEPRECATED)

These loss functions should be used for multinomial mutually exclusive classification, i.e. pick one out of N classes. Also applicable when N = 2.

The labels must be one-hot encoded or can contain soft class probabilities: a particular example can belong to class A with 50% probability and class B with 50% probability. Note that strictly speaking it doesn't mean that it belongs to both classes, but one can interpret the probabilities this way.

Just like in sigmoid family, tf.losses.softmax_cross_entropy allows to set the in-batch weights, i.e. make some examples more important than others. As far as I know, as of tensorflow 1.3, there's no built-in way to set class weights.

[UPD] In tensorflow 1.5, v2 version was introduced and the original softmax_cross_entropy_with_logits loss got deprecated. The only difference between them is that in a newer version, backpropagation happens into both logits and labels (here's a discussion why this may be useful).

Sparse functions family

tf.nn.sparse_softmax_cross_entropy_with_logits
tf.losses.sparse_softmax_cross_entropy
tf.contrib.losses.sparse_softmax_cross_entropy (DEPRECATED)

Like ordinary softmax above, these loss functions should be used for multinomial mutually exclusive classification, i.e. pick one out of N classes. The difference is in labels encoding: the classes are specified as integers (class index), not one-hot vectors. Obviously, this doesn't allow soft classes, but it can save some memory when there are thousands or millions of classes. However, note that logits argument must still contain logits per each class, thus it consumes at least [batch_size, classes] memory.

Like above, tf.losses version has a weights argument which allows to set the in-batch weights.

Sampled softmax functions family

tf.nn.sampled_softmax_loss
tf.contrib.nn.rank_sampled_softmax_loss
tf.nn.nce_loss

These functions provide another alternative for dealing with huge number of classes. Instead of computing and comparing an exact probability distribution, they compute a loss estimate from a random sample.

The arguments weights and biases specify a separate fully-connected layer that is used to compute the logits for a chosen sample.

Like above, labels are not one-hot encoded, but have the shape [batch_size, num_true].

Sampled functions are only suitable for training. In test time, it's recommended to use a standard softmax loss (either sparse or one-hot) to get an actual distribution.

Another alternative loss is tf.nn.nce_loss, which performs noise-contrastive estimation (if you're interested, see this very detailed discussion). I've included this function to the softmax family, because NCE guarantees approximation to softmax in the limit.

answered Sep 21 '22 06:09

Maxim

Related questions
                            
                                Interweaving two numpy arrays
                            
                                Access Lovoo API using Python
                            
                                Is it possible to copy a cell from one jupyter notebook to another?
                            
                                How to parse multiple nested sub-commands using python argparse?
                            
                                What does it mean for an object to be picklable (or pickle-able)?
                            
                                What are the differences between ipython and bpython?
                            
                                Automatically run %matplotlib inline in IPython Notebook
                            
                                Import python package from local directory into interpreter
                            
                                ValueError: could not convert string to float: id
                            
                                selenium with scrapy for dynamic page
                            
                                Union of 2 sets does not contain all items
                            
                                Is there any numpy group by function?
                            
                                How can I specify library versions in setup.py?
                            
                                Python setuptools: How can I list a private repository under install_requires?
                            
                                Argparse with required subparser
                            
                                how can i obtain pattern string from compiled regexp pattern in python
                            
                                open() gives FileNotFoundError/IOError: Errno 2 No such file or directory
                            
                                Finding index of nearest point in numpy arrays of x and y coordinates
                            
                                Is it possible to overload Python assignment?
                            
                                python socket.error: [Errno 98] Address already in use [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to choose cross-entropy loss in TensorFlow?

Tags:

python

neural-network

tensorflow

cross-entropy

logistic-regression

Maxim

People also ask

1 Answers

Preliminary facts

Sigmoid functions family

Softmax functions family

Sparse functions family

Sampled softmax functions family

Maxim

Recent Activity

Donate For Us