What is cross-entropy? [closed]

Q: What is entropy and cross-entropy?

The average number of bits needed to know about the event is different from the average number of bits used to transfer the information. Cross entropy is the average number of bits used to transfer the information. The cross entropy is always less than or equal to the entropy.

Q: What is cross entropy in machine learning?

Cross-entropy measures the performance of a classification model based on the probability and error, where the more likely (or the bigger the probability) of something is, the lower the cross-entropy. Let’s look deeper into this. Cross entropy is a loss function that can be used to quantify the difference between two probability distributions.

Q: How do you calculate cross entropy?

Cross-entropy can be calculated using the probabilities of the events from P and Q, as follows: Where P (x) is the probability of the event x in P, Q (x) is the probability of event x in Q and log is the base-2 logarithm, meaning that the results are in bits.

Q: Why is the cross entropy so high in this case?

The cross entropy is high in this case as there are several instances of misclassification of predicted output. If the probability of prediction is improved to say 50% to predict the default cases then the cross entropy will be- The cross entropy is now lower than the previous one when the prediction for default class was 35%.

Q: What is the difference between binary cross entropy and categorical cross entropy?

Binary Cross-Entropy: Cross-entropy as a loss function for a binary classification task. Categorical Cross-Entropy: Cross-entropy as a loss function for a multi-class classification task.

2 Answers

Cross-entropy is commonly used to quantify the difference between two probability distributions. In the context of machine learning, it is a measure of error for categorical multi-class classification problems. Usually the "true" distribution (the one that your machine learning algorithm is trying to match) is expressed in terms of a one-hot distribution.

For example, suppose for a specific training instance, the true label is B (out of the possible labels A, B, and C). The one-hot distribution for this training instance is therefore:

Pr(Class A)  Pr(Class B)  Pr(Class C)
        0.0          1.0          0.0

You can interpret the above true distribution to mean that the training instance has 0% probability of being class A, 100% probability of being class B, and 0% probability of being class C.

Now, suppose your machine learning algorithm predicts the following probability distribution:

Pr(Class A)  Pr(Class B)  Pr(Class C)
      0.228        0.619        0.153

How close is the predicted distribution to the true distribution? That is what the cross-entropy loss determines. Use this formula:

Cross entropy loss formula

Where p(x) is the true probability distribution (one-hot) and q(x) is the predicted probability distribution. The sum is over the three classes A, B, and C. In this case the loss is 0.479 :

H = - (0.0*ln(0.228) + 1.0*ln(0.619) + 0.0*ln(0.153)) = 0.479

Logarithm base

Note that it does not matter what logarithm base you use as long as you consistently use the same one. As it happens, the Python Numpy log() function computes the natural log (log base e).

Python code

Here is the above example expressed in Python using Numpy:

import numpy as np

p = np.array([0, 1, 0])             # True probability (one-hot)
q = np.array([0.228, 0.619, 0.153]) # Predicted probability

cross_entropy_loss = -np.sum(p * np.log(q))
print(cross_entropy_loss)
# 0.47965000629754095

So that is how "wrong" or "far away" your prediction is from the true distribution. A machine learning optimizer will attempt to minimize the loss (i.e. it will try to reduce the loss from 0.479 to 0.0).

Loss units

We see in the above example that the loss is 0.4797. Because we are using the natural log (log base e), the units are in nats, so we say that the loss is 0.4797 nats. If the log were instead log base 2, then the units are in bits. See this page for further explanation.

More examples

To gain more intuition on what these loss values reflect, let's look at some extreme examples.

Again, let's suppose the true (one-hot) distribution is:

Pr(Class A)  Pr(Class B)  Pr(Class C)
        0.0          1.0          0.0

Now suppose your machine learning algorithm did a really great job and predicted class B with very high probability:

Pr(Class A)  Pr(Class B)  Pr(Class C)
      0.001        0.998        0.001

When we compute the cross entropy loss, we can see that the loss is tiny, only 0.002:

p = np.array([0, 1, 0])
q = np.array([0.001, 0.998, 0.001])
print(-np.sum(p * np.log(q)))
# 0.0020020026706730793

At the other extreme, suppose your ML algorithm did a terrible job and predicted class C with high probability instead. The resulting loss of 6.91 will reflect the larger error.

Pr(Class A)  Pr(Class B)  Pr(Class C)
      0.001        0.001        0.998

p = np.array([0, 1, 0])
q = np.array([0.001, 0.001, 0.998])
print(-np.sum(p * np.log(q)))
# 6.907755278982137

Now, what happens in the middle of these two extremes? Suppose your ML algorithm can't make up its mind and predicts the three classes with nearly equal probability.

Pr(Class A)  Pr(Class B)  Pr(Class C)
      0.333        0.333        0.334

The resulting loss is 1.10.

p = np.array([0, 1, 0])
q = np.array([0.333, 0.333, 0.334])
print(-np.sum(p * np.log(q)))
# 1.0996127890016931

Fitting into gradient descent

Cross entropy is one out of many possible loss functions (another popular one is SVM hinge loss). These loss functions are typically written as J(theta) and can be used within gradient descent, which is an iterative algorithm to move the parameters (or coefficients) towards the optimum values. In the equation below, you would replace J(theta) with H(p, q). But note that you need to compute the derivative of H(p, q) with respect to the parameters first.

gradient descent

So to answer your original questions directly:

Is it only a method to describe the loss function?

Correct, cross-entropy describes the loss between two probability distributions. It is one of many possible loss functions.

Then we can use, for example, gradient descent algorithm to find the minimum.

Yes, the cross-entropy loss function can be used as part of gradient descent.

Further reading: one of my other answers related to TensorFlow.

106

answered Oct 10 '22 03:10

stackoverflowuser2010

In short, cross-entropy(CE) is the measure of how far is your predicted value from the true label.

The cross here refers to calculating the entropy between two or more features / true labels (like 0, 1).

And the term entropy itself refers to randomness, so large value of it means your prediction is far off from real labels.

So the weights are changed to reduce CE and thus finally leads to reduced difference between the prediction and true labels and thus better accuracy.

answered Oct 10 '22 02:10

Harsh Malra

Related questions
                            
                                scikit-learn .predict() default threshold
                            
                                What is the intuition of using tanh in LSTM? [closed]
                            
                                RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same
                            
                                How to create a new gym environment in OpenAI?
                            
                                keras: how to save the training history attribute of the history object
                            
                                How to get Tensorflow tensor dimensions (shape) as int values?
                            
                                What is machine learning? [closed]
                            
                                What is the difference between np.mean and tf.reduce_mean?
                            
                                Keras: Difference between Kernel and Activity regularizers
                            
                                Understanding min_df and max_df in scikit CountVectorizer
                            
                                What is the role of TimeDistributed layer in Keras?
                            
                                Error in Python script "Expected 2D array, got 1D array instead:"?
                            
                                What is the mAP metric and how is it calculated? [closed]
                            
                                Common causes of nans during training
                            
                                Python: tf-idf-cosine: to find document similarity
                            
                                word2vec: negative sampling (in layman term)?
                            
                                How to concatenate two layers in keras?
                            
                                multi-layer perceptron (MLP) architecture: criteria for choosing number of hidden layers and size of the hidden layer? [closed]
                            
                                Why is the F-Measure a harmonic mean and not an arithmetic mean of the Precision and Recall measures?
                            
                                How to load a model from an HDF5 file in Keras?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is cross-entropy? [closed]

Tags:

machine-learning

cross-entropy

theateist

People also ask