Loss of CNN in Keras becomes nan at some point of training

Question

I am training the last layer of VGG16 in Keras. My models looks like:

map_characters1 = {0: 'No Pneumonia', 1: 'Yes Pneumonia'}
class_weight1 = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
weight_path1 = './imagenet_models/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5'
pretrained_model_1 = VGG16(weights = 'imagenet', include_top=False, input_shape=(200, 200, 3))

optimizer1 = keras.optimizers.Adam(lr=0.0001)
def pretrainedNetwork(xtrain,ytrain,xtest,ytest,pretrainedmodel,pretrainedweights,classweight,numclasses,numepochs,optimizer,labels):
    base_model = pretrained_model_1 # Topless
    # Add top layer
    x = base_model.output
    x = Flatten()(x)
    predictions = Dense(numclasses, activation='relu')(x)
    model = Model(inputs=base_model.input, outputs=predictions)
    # Train top layer
    for layer in base_model.layers:
        layer.trainable = False
    model.compile(loss='categorical_crossentropy', 
              optimizer=optimizer, 
              metrics=['accuracy'])
    callbacks_list = [keras.callbacks.EarlyStopping(monitor='val_acc', patience=3, verbose=1)]
    model.summary()
    # Fit model
    history = model.fit(xtrain,ytrain, epochs=numepochs, class_weight=classweight, validation_data=(xtest,ytest), verbose=1,callbacks = [MetricsCheckpoint('logs')])
    # Evaluate model
    score = model.evaluate(xtest,ytest, verbose=0)
    print('
Keras CNN - accuracy:', score[1], '
')

return model

The training looks fine at the beginning: loss decreases, accuracy increases. But then the loss becomes nan and accuracy becomes 0.5 - as a random guess.

The model:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 200, 200, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 200, 200, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 200, 200, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 100, 100, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 100, 100, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 100, 100, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 50, 50, 128)       0         
_________________________________________________________________
block3_conv1 (Conv2D)        (None, 50, 50, 256)       295168    
_________________________________________________________________
block3_conv2 (Conv2D)        (None, 50, 50, 256)       590080    
_________________________________________________________________
block3_conv3 (Conv2D)        (None, 50, 50, 256)       590080    
_________________________________________________________________
block3_pool (MaxPooling2D)   (None, 25, 25, 256)       0         
_________________________________________________________________
block4_conv1 (Conv2D)        (None, 25, 25, 512)       1180160   
_________________________________________________________________
block4_conv2 (Conv2D)        (None, 25, 25, 512)       2359808   
_________________________________________________________________
block4_conv3 (Conv2D)        (None, 25, 25, 512)       2359808   
_________________________________________________________________
block4_pool (MaxPooling2D)   (None, 12, 12, 512)       0         
_________________________________________________________________
block5_conv1 (Conv2D)        (None, 12, 12, 512)       2359808   
_________________________________________________________________
block5_conv2 (Conv2D)        (None, 12, 12, 512)       2359808   
_________________________________________________________________
block5_conv3 (Conv2D)        (None, 12, 12, 512)       2359808   
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, 6, 6, 512)         0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 18432)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 36866     
=================================================================
Total params: 14,751,554
Trainable params: 36,866
Non-trainable params: 14,714,688

Training output:

Train on 2682 samples, validate on 468 samples

Epoch 1/6
2682/2682 [==============================] - 621s 232ms/step - loss: 1.5150 - acc: 0.7662 - val_loss: 0.4117 - val_acc: 0.8526

Epoch 2/6
2682/2682 [==============================] - 615s 229ms/step - loss: 0.2535 - acc: 0.9459 - val_loss: 1.7812 - val_acc: 0.7009

Epoch 3/6
2682/2682 [==============================] - 621s 232ms/step - loss: nan - acc: 0.7468 - val_loss: nan - val_acc: 0.5000

Epoch 4/6
2682/2682 [==============================] - 644s 240ms/step - loss: nan - acc: 0.5000 - val_loss: nan - val_acc: 0.5000

Epoch 5/6
2682/2682 [==============================] - 616s 230ms/step - loss: nan - acc: 0.5000 - val_loss: nan - val_acc: 0.5000

Where could I find the problem? What is happening with loss?

razimbres · Accepted Answer

You have an exploding gradient. Simplifying, consider a convex optimization by gradient descent. The goal of the Neural Network is to optimize weights in a way the derivative of loss become zero, at the bottom (green) of the following figure:

Gradient Descent

Gradient 2

The exploding gradient is where the gradient becomes almost parallel to the Sum of Squared Errors axis, generating nans.

There are some fixes for this, as Batch Normalization, weight initialization, the use of ReLU activation functions and a smaller learning rate. For vanishing gradients in LSTM, even the optimizer matters.

If your learning rate is not small enough, the training may become a zig zag in the gradient, missing the local minimum:

Big Learning rate

Ekaterina Tcareva · Answer

The problem was that I used activation='relu' in the prediction layer. I changed it to 'softmax' and now it works!

Loss of CNN in Keras becomes nan at some point of training

Tags:

python

deep-learning

keras

conv-neural-network

Ekaterina Tcareva

2 Answers

razimbres

Ekaterina Tcareva

Recent Activity

Donate For Us

Loss of CNN in Keras becomes nan at some point of training

Tags:

python

deep-learning

keras

conv-neural-network

Ekaterina Tcareva

2 Answers

razimbres

Ekaterina Tcareva

Related questions

Recent Activity

Donate For Us