I'm learning how to create convolutional neural networks using Keras. I'm trying to get a high accuracy for the MNIST dataset. Apparently <code>categorical_crossentropy</code> is for more than 2 classes and <code>binary_crossentropy</code> is for 2 classes. Since there are 10 digits, I should be using <code>categorical_crossentropy</code>. However, after training and testing dozens of models, <code>binary_crossentropy</code> consistently outperforms <code>categorical_crossentropy</code> significantly. On Kaggle, I got 99+% accuracy using <code>binary_crossentropy</code> and 10 epochs. Meanwhile, I can't get above 97% using <code>categorical_crossentropy</code>, even using 30 epochs (which isn't much, but I don't have a GPU, so training takes forever). Here's what my model looks like now: <pre class="prettyprint lang-python prettyprint-override"><code>model = Sequential() model.add(Convolution2D(100, 5, 5, border_mode='valid', input_shape=(28, 28, 1), init='glorot_uniform', activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Convolution2D(100, 3, 3, init='glorot_uniform', activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.3)) model.add(Flatten()) model.add(Dense(100, init='glorot_uniform', activation='relu')) model.add(Dropout(0.3)) model.add(Dense(100, init='glorot_uniform', activation='relu')) model.add(Dropout(0.3)) model.add(Dense(10, init='glorot_uniform', activation='softmax')) model.compile(loss='binary_crossentropy', optimizer='adamax', metrics=['accuracy']) </code></pre>

First of all, binary_crossentropy is not when there are two classes. The "binary" name is because it is adapted for binary output, and each number of the softmax is aimed at being 0 or 1. Here, it checks for each number of the output. It doesn't explain your result, since categorical_entropy exploits the fact that it is a classification problem. Are you sure that when you read your data there is one and only one class per sample? It's the only one explanation I can give.

Why is binary_crossentropy more accurate than categorical_crossentropy for multiclass classification in Keras?

Tags:

machine-learning

neural-network

deep-learning

keras

conv-neural-network

I'm learning how to create convolutional neural networks using Keras. I'm trying to get a high accuracy for the MNIST dataset.

Apparently categorical_crossentropy is for more than 2 classes and binary_crossentropy is for 2 classes. Since there are 10 digits, I should be using categorical_crossentropy. However, after training and testing dozens of models, binary_crossentropy consistently outperforms categorical_crossentropy significantly.

On Kaggle, I got 99+% accuracy using binary_crossentropy and 10 epochs. Meanwhile, I can't get above 97% using categorical_crossentropy, even using 30 epochs (which isn't much, but I don't have a GPU, so training takes forever).

Here's what my model looks like now:

model = Sequential()
model.add(Convolution2D(100, 5, 5, border_mode='valid', input_shape=(28, 28, 1), init='glorot_uniform', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Convolution2D(100, 3, 3, init='glorot_uniform', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))
model.add(Flatten())
model.add(Dense(100, init='glorot_uniform', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(100, init='glorot_uniform', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(10, init='glorot_uniform', activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adamax', metrics=['accuracy'])

738

asked Dec 26 '16 07:12

Leo Jiang

2 Answers

Short answer: it is not.

To see that, simply try to calculate the accuracy "by hand", and you will see that it is different from the one reported by Keras with the model.evaluate method:

# Keras reported accuracy: score = model.evaluate(x_test, y_test, verbose=0)  score[1] # 0.99794011611938471  # Actual accuracy calculated manually: import numpy as np y_pred = model.predict(x_test) acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000 acc # 0.98999999999999999

The reason it seems to be so is a rather subtle issue at how Keras actually guesses which accuracy to use, depending on the loss function you have selected, when you include simply metrics=['accuracy'] in your model compilation.

If you check the source code, Keras does not define a single accuracy metric, but several different ones, among them binary_accuracy and categorical_accuracy. What happens under the hood is that, since you have selected binary cross entropy as your loss function and have not specified a particular accuracy metric, Keras (wrongly...) infers that you are interested in the binary_accuracy, and this is what it returns.

To avoid that, i.e. to use indeed binary cross entropy as your loss function (nothing wrong with this, in principle) while still getting the categorical accuracy required by the problem at hand (i.e. MNIST classification), you should ask explicitly for categorical_accuracy in the model compilation as follows:

from keras.metrics import categorical_accuracy model.compile(loss='binary_crossentropy', optimizer='adamax', metrics=[categorical_accuracy])

And after training, scoring, and predicting the test set as I show above, the two metrics now are the same, as they should be:

sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000 == score[1] # True

(HT to this great answer to a similar problem, which helped me understand the issue...)

UPDATE: After my post, I discovered that this issue had already been identified in this answer.

140

answered Oct 19 '22 22:10

desertnaut

First of all, binary_crossentropy is not when there are two classes.

The "binary" name is because it is adapted for binary output, and each number of the softmax is aimed at being 0 or 1. Here, it checks for each number of the output.

It doesn't explain your result, since categorical_entropy exploits the fact that it is a classification problem.

Are you sure that when you read your data there is one and only one class per sample? It's the only one explanation I can give.

answered Oct 20 '22 00:10

Labo

Related questions
                            
                                How to pre-process new instances for classification, so that the feature encoding is the same as the model with Scikit-learn?
                            
                                Object categories of pretrained imagenet model in caffe
                            
                                Spark MlLib linear regression (Linear least squares) giving random results
                            
                                Missing value error in the randomForest package of R
                            
                                Normalizing a list of restaurant dishes
                            
                                Neural Network Backpropagation implementation issues
                            
                                Tensorflow: List of Tensors for Cost
                            
                                How can I handle huge matrices?
                            
                                How to determine maximum batch size for a seq2seq tensorflow RNN training model
                            
                                Python keras how to change the size of input after convolution layer into lstm layer
                            
                                Function to determine a reasonable initial guess for scipy.optimize?
                            
                                Selecting the components showing the most variance in PCA
                            
                                How to use sklearn Pipeline with custom Features?
                            
                                Caffe sigmoid cross entropy loss
                            
                                How to obtain a confidence interval or a measure of prediction dispersion when using xgboost for classification?
                            
                                SVM - Difference between Energy vs Loss vs Regularization vs Cost function
                            
                                Keras RNN loss does not decrease over epoch
                            
                                Difference between LinearRegression() and Ridge(alpha=0)
                            
                                Image resizing method during preprocessing for neural network
                            
                                GridSearch with Keras Neural Networks

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With