Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Validation loss is lower than training loss training LSTM

Tags:

I am training a LSTM using tf.learn in tensorflow. I have split the data into training (90%) and validation (10%) for this purpose. As I understand, a model usually fits better training data than validation data but I am getting the opposite results. Loss is lower and accuracy is higher for validation set.

As I have read in other answers, this can be because of dropout not being applied during validation. However, when I remove dropout from my LSTM architecture I validation loss is still lower than training loss (difference is smaller though).

Also, the loss shown at the end of each epoch is not an average of the losses over each batch (like when using Keras). It is the loss for he last batch. I also thought this could be a reason for my results but turned out it was not.

Training samples: 783
Validation samples: 87
--
Training Step: 4  | total loss: 1.08214 | time: 1.327s
| Adam | epoch: 001 | loss: 1.08214 - acc: 0.7549 | val_loss: 0.53043 - val_acc: 0.9885 -- iter: 783/783
--
Training Step: 8  | total loss: 0.41462 | time: 1.117s
| Adam | epoch: 002 | loss: 0.41462 - acc: 0.9759 | val_loss: 0.17027 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 12  | total loss: 0.15111 | time: 1.124s
| Adam | epoch: 003 | loss: 0.15111 - acc: 0.9984 | val_loss: 0.07488 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 16  | total loss: 0.10145 | time: 1.114s
| Adam | epoch: 004 | loss: 0.10145 - acc: 0.9950 | val_loss: 0.04173 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 20  | total loss: 0.26568 | time: 1.124s
| Adam | epoch: 005 | loss: 0.26568 - acc: 0.9615 | val_loss: 0.03077 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 24  | total loss: 0.11023 | time: 1.129s
| Adam | epoch: 006 | loss: 0.11023 - acc: 0.9863 | val_loss: 0.02607 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 28  | total loss: 0.07059 | time: 1.141s
| Adam | epoch: 007 | loss: 0.07059 - acc: 0.9934 | val_loss: 0.01882 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 32  | total loss: 0.03571 | time: 1.122s
| Adam | epoch: 008 | loss: 0.03571 - acc: 0.9977 | val_loss: 0.01524 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 36  | total loss: 0.05084 | time: 1.120s
| Adam | epoch: 009 | loss: 0.05084 - acc: 0.9948 | val_loss: 0.01384 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 40  | total loss: 0.22283 | time: 1.132s
| Adam | epoch: 010 | loss: 0.22283 - acc: 0.9714 | val_loss: 0.01227 - val_acc: 1.0000 -- iter: 783/783

The network used (note that dropout has been commented out):

def get_network_wide(frames, input_size, num_classes):
    """Create a one-layer LSTM"""
    net = tflearn.input_data(shape=[None, frames, input_size])
    #net = tflearn.lstm(net, 256, dropout=0.2)
    net = tflearn.fully_connected(net, num_classes, activation='softmax')
    net = tflearn.regression(net, optimizer='adam',
                             loss='categorical_crossentropy',metric='default', name='output1')
    return net 

Plot of validation loss vs training loss

like image 944
s0x Avatar asked Jun 13 '19 07:06

s0x


People also ask

Why is my validation loss higher than my training loss?

At times, the validation loss is greater than the training loss. This may indicate that the model is underfitting. Underfitting occurs when the model is unable to accurately model the training data, and hence generates large errors.

Does Lstm have high training loss?

training loss during LSTM training is higher than validation loss.

When validation accuracy is lower than training?

If your model's accuracy on your testing data is lower than your training or validation accuracy, it usually indicates that there are meaningful differences between the kind of data you trained the model on and the testing data you're providing for evaluation.

Should validation loss be higher or lower?

The smaller the loss, the better a job the classifier is at modeling the relationship between the input data and the output targets. That said, there is a point where we can overfit our model — by modeling the training data too closely, our model loses the ability to generalize.


1 Answers

This is not necessarily a problematic phenomenon in essence.

It can take place due to many reasons, as stated below.

  1. It may happen usually when your training data is harder to train on/learn patters on it, while the validation set boasts 'easy' images/data to classify on. The same is available for LSTM/sequence classification data.
  2. It may happen in the early phase of the training that the validation loss is smaller than the training loss + validation accuracy is bigger than the training accuracy.
  3. During validation, dropout is not enabled, leading to greater results on the validation set.
  4. The loss on training is calculated epoch-wise. At the end of an epoch, it's the mean of batch losses (accumulated) throughout the epoch. The network learned patterns/relationships between&within the data, and when testing it against the validation, it uses the information already learned and it follows that the results can be better on validation E.g. TLosses [0.60,0.59,...0.3 (loss on TS at the end of the epoch)] -> VLosses [0.3,0.29,0.35] (because the model has already trained a lot as compared to the start of the epoch.

However, your training set is very small, as well as your validation set. Such a split (90% on train and 10% on validation/development) should be made only when there is very much data (tens of thousands or even hundreds of thousands in this case). On the other hand, your entire training set(train + val) has less than 1000 samples. You need much more data, as LSTMs are well-known for requiring a lot of training data.

Then, you could try using KFoldCrossValidation or even StratifiedKFoldCrossValidation. In this way, you would ensure that you have not manually created a very 'easy' validation set, on which you always test; instead, you can have k-folds, out of which k-1 are used for training and 1 for validation; in this way you may avoid situation (1).

The answer lies in the data. Prepare it carefully, as the results depend significantly on the quality of data (preprocessing data, cleaning data, creating relevant training/validation/test sets).

like image 129
Timbus Calin Avatar answered Sep 19 '22 09:09

Timbus Calin