Why binary_crossentropy and categorical_crossentropy give different performances for the same problem?

Question

The reason for this apparent performance discrepancy between categorical & binary cross entropy is what user xtof54 has already reported in his answer below, i.e.:

the accuracy computed with the Keras method evaluate is just plain wrong when using binary_crossentropy with more than 2 labels

I would like to elaborate more on this, demonstrate the actual underlying issue, explain it, and offer a remedy.

This behavior is not a bug; the underlying reason is a rather subtle & undocumented issue at how Keras actually guesses which accuracy to use, depending on the loss function you have selected, when you include simply metrics=['accuracy'] in your model compilation. In other words, while your first compilation option

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

is valid, your second one:

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

will not produce what you expect, but the reason is not the use of binary cross entropy (which, at least in principle, is an absolutely valid loss function).

Why is that? If you check the metrics source code, Keras does not define a single accuracy metric, but several different ones, among them binary_accuracy and categorical_accuracy. What happens under the hood is that, since you have selected binary cross entropy as your loss function and have not specified a particular accuracy metric, Keras (wrongly...) infers that you are interested in the binary_accuracy, and this is what it returns - while in fact you are interested in the categorical_accuracy.

Let's verify that this is the case, using the MNIST CNN example in Keras, with the following modification:

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])  # WRONG way

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=2,  # only 2 epochs, for demonstration purposes
          verbose=1,
          validation_data=(x_test, y_test))

# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0) 
score[1]
# 0.9975801164627075

# Actual accuracy calculated manually:
import numpy as np
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98780000000000001

score[1]==acc
# False

To remedy this, i.e. to use indeed binary cross entropy as your loss function (as I said, nothing wrong with this, at least in principle) while still getting the categorical accuracy required by the problem at hand, you should ask explicitly for categorical_accuracy in the model compilation as follows:

from keras.metrics import categorical_accuracy
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[categorical_accuracy])

In the MNIST example, after training, scoring, and predicting the test set as I show above, the two metrics now are the same, as they should be:

# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0) 
score[1]
# 0.98580000000000001

# Actual accuracy calculated manually:
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98580000000000001

score[1]==acc
# True

System setup:

Python version 3.5.3
Tensorflow version 1.2.1
Keras version 2.0.4

UPDATE: After my post, I discovered that this issue had already been identified in this answer.

It all depends on the type of classification problem you are dealing with. There are three main categories

binary classification (two target classes),
multi-class classification (more than two exclusive targets),
multi-label classification (more than two non exclusive targets), in which multiple target classes can be on at the same time.

In the first case, binary cross-entropy should be used and targets should be encoded as one-hot vectors.

In the second case, categorical cross-entropy should be used and targets should be encoded as one-hot vectors.

In the last case, binary cross-entropy should be used and targets should be encoded as one-hot vectors. Each output neuron (or unit) is considered as a separate random binary variable, and the loss for the entire vector of outputs is the product of the loss of single binary variables. Therefore it is the product of binary cross-entropy for each single output unit.

The binary cross-entropy is defined as

enter image description here

and categorical cross-entropy is defined as

enter image description here

where c is the index running over the number of classes C.

I came across an "inverted" issue — I was getting good results with categorical_crossentropy (with 2 classes) and poor with binary_crossentropy. It seems that problem was with wrong activation function. The correct settings were:

for binary_crossentropy: sigmoid activation, scalar target
for categorical_crossentropy: softmax activation, one-hot encoded target

It's really interesting case. Actually in your setup the following statement is true:

binary_crossentropy = len(class_id_index) * categorical_crossentropy

This means that up to a constant multiplication factor your losses are equivalent. The weird behaviour that you are observing during a training phase might be an example of a following phenomenon:

At the beginning the most frequent class is dominating the loss - so network is learning to predict mostly this class for every example.
After it learnt the most frequent pattern it starts discriminating among less frequent classes. But when you are using adam - the learning rate has a much smaller value than it had at the beginning of training (it's because of the nature of this optimizer). It makes training slower and prevents your network from e.g. leaving a poor local minimum less possible.

That's why this constant factor might help in case of binary_crossentropy. After many epochs - the learning rate value is greater than in categorical_crossentropy case. I usually restart training (and learning phase) a few times when I notice such behaviour or/and adjusting a class weights using the following pattern:

class_weight = 1 / class_frequency

This makes loss from a less frequent classes balancing the influence of a dominant class loss at the beginning of a training and in a further part of an optimization process.

EDIT:

Actually - I checked that even though in case of maths:

binary_crossentropy = len(class_id_index) * categorical_crossentropy

should hold - in case of keras it's not true, because keras is automatically normalizing all outputs to sum up to 1. This is the actual reason behind this weird behaviour as in case of multiclassification such normalization harms a training.

Why binary_crossentropy and categorical_crossentropy give different performances for the same problem?

Tags:

machine-learning

neural-network

deep-learning

keras

conv-neural-network

Recent Activity

Donate For Us

Why binary_crossentropy and categorical_crossentropy give different performances for the same problem?

Tags:

machine-learning

neural-network

deep-learning

keras

conv-neural-network

Related questions

Recent Activity

Donate For Us