I am building a simple CNN for binary image classification, and the AUC obtained from model.evaluate() is much higher than AUC obtained from model.predict() + roc_auc_score().
The whole notebook is here.
Compiling model and output for model.fit():
model.compile(loss='binary_crossentropy',
optimizer=RMSprop(lr=0.001),
metrics=['AUC'])
history = model.fit(
train_generator,
steps_per_epoch=8,
epochs=5,
verbose=1)
Epoch 1/5 8/8 [==============================] - 21s 3s/step - loss: 6.7315 - auc: 0.5143
Epoch 2/5 8/8 [==============================] - 15s 2s/step - loss: 0.6626 - auc: 0.6983
Epoch 3/5 8/8 [==============================] - 18s 2s/step - loss: 0.4296 - auc: 0.8777
Epoch 4/5 8/8 [==============================] - 14s 2s/step - loss: 0.2330 - auc: 0.9606
Epoch 5/5 8/8 [==============================] - 18s 2s/step - loss: 0.1985 - auc: 0.9767
Then model.evaluate() gives something similar:
model.evaluate(train_generator)
9/9 [==============================] - 10s 1s/step - loss: 0.3056 - auc: 0.9956
But then AUC calculated directly from model.predict() method is twice as lower:
from sklearn import metrics
x = model.predict(train_generator)
metrics.roc_auc_score(train_generator.labels, x)
0.5006148007590132
I have read several posts on similar issues (like this, this, this and also extensive discussion on github), but they describe reasons which are irrelevant for my case:
Any suggestions are much appreciated. Thanks!
EDIT! Solution I have founded the solution here, I just needed to call
train_generator.reset()
before model.predict and also set shuffle = False in flow_from_directory() function. The reason for difference is that generator outputs batches starting from different position, so labels and predictions will not match, because they relate to different objects. So the problem is not with evaluate or predict methods, but with generator.
EDIT 2 Using train_generator.reset() is not convenient if generator is created using flow_from_directory(), because it requires setting shuffle = False in flow_from_directory, but this will create batches containing single class during training, which affects learning. So I ended up with redefining train_generator before running predict.
The keras. evaluate() function will give you the loss value for every batch. The keras. predict() function will give you the actual predictions for all samples in a batch, for all batches.
fit() is for training the model with the given inputs (and corresponding training labels). evaluate() is for evaluating the already trained model using the validation (or test) data and the corresponding labels.
You need to create the accuracy yourself in model_fn using tf. metrics. accuracy and pass it to eval_metric_ops that will be returned by the function. Then the output of estimator.
Model. predict passes the input vector through the model and returns the output tensor for each datapoint. Since the last layer in your model is a single Dense neuron, the output for any datapoint is a single value. And since you didn't specify an activation for the last layer, it will default to linear activation.
tensorflow.keras
AUC computes the approximate AUC (Area under the curve) via a Riemann sum, which is not the same implementation as scikit-learn.
If you want to find the AUC with tensorflow.keras
, try:
import tensorflow as tf
m = tf.keras.metrics.AUC()
m.update_state(train_generator.labels, x) # assuming both have shape (N,)
r = m.result().numpy()
print(r)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With