Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the differences between all these cross-entropy losses in Keras and TensorFlow?

What are the differences between all these cross-entropy losses?

Keras is talking about

  • Binary cross-entropy
  • Categorical cross-entropy
  • Sparse categorical cross-entropy

While TensorFlow has

  • Softmax cross-entropy with logits
  • Sparse softmax cross-entropy with logits
  • Sigmoid cross-entropy with logits

What are the differences and relationships between them? What are the typical applications for them? What's the mathematical background? Are there other cross-entropy types that one should know? Are there any cross-entropy types without logits?

like image 202
ScientiaEtVeritas Avatar asked Jun 21 '17 11:06

ScientiaEtVeritas


People also ask

What is the difference between Categorical_crossentropy and Sparse_categorical_crossentropy?

categorical_crossentropy ( cce ) produces a one-hot array containing the probable match for each category, sparse_categorical_crossentropy ( scce ) produces a category index of the most likely matching category.

What is binary cross-entropy loss in keras?

The Binary Cross entropy will calculate the cross-entropy loss between the predicted classes and the true classes. By default, the sum_over_batch_size reduction is used. This means that the loss will return the average of the per-sample losses in the batch.

What is the difference between categorical cross-entropy and sparse categorical cross-entropy loss functions?

The only difference between sparse categorical cross entropy and categorical cross entropy is the format of true labels. When we have a single-label, multi-class classification problem, the labels are mutually exclusive for each data, meaning each data entry can only belong to one class.

What is cross-entropy in Tensorflow?

Cross entropy can be used to define a loss function (cost function) in machine learning and optimization. It is defined on probability distributions, not single values. It works for classification because classifier output is (often) a probability distribution over class labels.


1 Answers

There is just one cross (Shannon) entropy defined as:

H(P||Q) = - SUM_i P(X=i) log Q(X=i) 

In machine learning usage, P is the actual (ground truth) distribution, and Q is the predicted distribution. All the functions you listed are just helper functions which accepts different ways to represent P and Q.

There are basically 3 main things to consider:

  • there are either 2 possibles outcomes (binary classification) or more. If there are just two outcomes, then Q(X=1) = 1 - Q(X=0) so a single float in (0,1) identifies the whole distribution, this is why neural network in binary classification has a single output (and so does logistic regresssion). If there are K>2 possible outcomes one has to define K outputs (one per each Q(X=...))

  • one either produces proper probabilities (meaning that Q(X=i)>=0 and SUM_i Q(X=i) =1 or one just produces a "score" and has some fixed method of transforming score to probability. For example a single real number can be "transformed to probability" by taking sigmoid, and a set of real numbers can be transformed by taking their softmax and so on.

  • there is j such that P(X=j)=1 (there is one "true class", targets are "hard", like "this image represent a cat") or there are "soft targets" (like "we are 60% sure this is a cat, but for 40% it is actually a dog").

Depending on these three aspects, different helper function should be used:

                                  outcomes     what is in Q    targets in P    ------------------------------------------------------------------------------- binary CE                                2      probability         any categorical CE                          >2      probability         soft sparse categorical CE                   >2      probability         hard sigmoid CE with logits                   2      score               any softmax CE with logits                  >2      score               soft sparse softmax CE with logits           >2      score               hard 

In the end one could just use "categorical cross entropy", as this is how it is mathematically defined, however since things like hard targets or binary classification are very popular - modern ML libraries do provide these additional helper functions to make things simpler. In particular "stacking" sigmoid and cross entropy might be numerically unstable, but if one knows these two operations are applied together - there is a numerically stable version of them combined (which is implemented in TF).

It is important to notice that if you apply wrong helper function the code will usually still execute, but results will be wrong. For example if you apply softmax_* helper for binary classification with one output your network will be considered to always produce "True" at the output.

As a final note - this answer considers classification, it is slightly different when you consider multi label case (when a single point can have multiple labels), as then Ps do not sum to 1, and one should use sigmoid_cross_entropy_with_logits despite having multiple output units.

like image 199
lejlot Avatar answered Sep 20 '22 22:09

lejlot