Validation loss is lower than training loss training LSTM

Tags:

I am training a LSTM using tf.learn in tensorflow. I have split the data into training (90%) and validation (10%) for this purpose. As I understand, a model usually fits better training data than validation data but I am getting the opposite results. Loss is lower and accuracy is higher for validation set.

As I have read in other answers, this can be because of dropout not being applied during validation. However, when I remove dropout from my LSTM architecture I validation loss is still lower than training loss (difference is smaller though).

Also, the loss shown at the end of each epoch is not an average of the losses over each batch (like when using Keras). It is the loss for he last batch. I also thought this could be a reason for my results but turned out it was not.

Training samples: 783
Validation samples: 87
--
Training Step: 4  | total loss: 1.08214 | time: 1.327s
| Adam | epoch: 001 | loss: 1.08214 - acc: 0.7549 | val_loss: 0.53043 - val_acc: 0.9885 -- iter: 783/783
--
Training Step: 8  | total loss: 0.41462 | time: 1.117s
| Adam | epoch: 002 | loss: 0.41462 - acc: 0.9759 | val_loss: 0.17027 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 12  | total loss: 0.15111 | time: 1.124s
| Adam | epoch: 003 | loss: 0.15111 - acc: 0.9984 | val_loss: 0.07488 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 16  | total loss: 0.10145 | time: 1.114s
| Adam | epoch: 004 | loss: 0.10145 - acc: 0.9950 | val_loss: 0.04173 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 20  | total loss: 0.26568 | time: 1.124s
| Adam | epoch: 005 | loss: 0.26568 - acc: 0.9615 | val_loss: 0.03077 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 24  | total loss: 0.11023 | time: 1.129s
| Adam | epoch: 006 | loss: 0.11023 - acc: 0.9863 | val_loss: 0.02607 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 28  | total loss: 0.07059 | time: 1.141s
| Adam | epoch: 007 | loss: 0.07059 - acc: 0.9934 | val_loss: 0.01882 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 32  | total loss: 0.03571 | time: 1.122s
| Adam | epoch: 008 | loss: 0.03571 - acc: 0.9977 | val_loss: 0.01524 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 36  | total loss: 0.05084 | time: 1.120s
| Adam | epoch: 009 | loss: 0.05084 - acc: 0.9948 | val_loss: 0.01384 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 40  | total loss: 0.22283 | time: 1.132s
| Adam | epoch: 010 | loss: 0.22283 - acc: 0.9714 | val_loss: 0.01227 - val_acc: 1.0000 -- iter: 783/783

The network used (note that dropout has been commented out):

def get_network_wide(frames, input_size, num_classes):
    """Create a one-layer LSTM"""
    net = tflearn.input_data(shape=[None, frames, input_size])
    #net = tflearn.lstm(net, 256, dropout=0.2)
    net = tflearn.fully_connected(net, num_classes, activation='softmax')
    net = tflearn.regression(net, optimizer='adam',
                             loss='categorical_crossentropy',metric='default', name='output1')
    return net

Plot of validation loss vs training loss

944

asked Jun 13 '19 07:06

s0x

1 Answers

This is not necessarily a problematic phenomenon in essence.

It can take place due to many reasons, as stated below.

It may happen usually when your training data is harder to train on/learn patters on it, while the validation set boasts 'easy' images/data to classify on. The same is available for LSTM/sequence classification data.
It may happen in the early phase of the training that the validation loss is smaller than the training loss + validation accuracy is bigger than the training accuracy.
During validation, dropout is not enabled, leading to greater results on the validation set.
The loss on training is calculated epoch-wise. At the end of an epoch, it's the mean of batch losses (accumulated) throughout the epoch. The network learned patterns/relationships between&within the data, and when testing it against the validation, it uses the information already learned and it follows that the results can be better on validation E.g. TLosses [0.60,0.59,...0.3 (loss on TS at the end of the epoch)] -> VLosses [0.3,0.29,0.35] (because the model has already trained a lot as compared to the start of the epoch.

However, your training set is very small, as well as your validation set. Such a split (90% on train and 10% on validation/development) should be made only when there is very much data (tens of thousands or even hundreds of thousands in this case). On the other hand, your entire training set(train + val) has less than 1000 samples. You need much more data, as LSTMs are well-known for requiring a lot of training data.

Then, you could try using KFoldCrossValidation or even StratifiedKFoldCrossValidation. In this way, you would ensure that you have not manually created a very 'easy' validation set, on which you always test; instead, you can have k-folds, out of which k-1 are used for training and 1 for validation; in this way you may avoid situation (1).

The answer lies in the data. Prepare it carefully, as the results depend significantly on the quality of data (preprocessing data, cleaning data, creating relevant training/validation/test sets).

129

answered Sep 19 '22 09:09

Timbus Calin

Related questions
                            
                                V-validate on dynamic custom component
                            
                                Using Open ID Connect with Server Side Blazor
                            
                                How to perform multiple row-wise operations with dependency with previous rows using [r] data.table (if possible)
                            
                                Using flutter mobile packages in flutter web
                            
                                .dockerignore fails to include files in subdirectories with !**/*.extension pattern
                            
                                Transform a flat list to domain objects with child objects using java streams
                            
                                gremlin python - add multiple but an unknown number of properties to a vertex
                            
                                git push using ssh-agent in bash but not gui
                            
                                Get closest point in collider
                            
                                Why does calling `make-instance` in `let` work differently?
                            
                                How to redirect without flickering in React using React Router?
                            
                                How to use filter conditions on SHOW PARTITIONS clause on hive?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Validation loss is lower than training loss training LSTM

Tags:

s0x

People also ask

1 Answers

Timbus Calin

Recent Activity

Donate For Us