I am training a LSTM using tf.learn in tensorflow. I have split the data into training (90%) and validation (10%) for this purpose. As I understand, a model usually fits better training data than validation data but I am getting the opposite results. Loss is lower and accuracy is higher for validation set.
As I have read in other answers, this can be because of dropout not being applied during validation. However, when I remove dropout from my LSTM architecture I validation loss is still lower than training loss (difference is smaller though).
Also, the loss shown at the end of each epoch is not an average of the losses over each batch (like when using Keras). It is the loss for he last batch. I also thought this could be a reason for my results but turned out it was not.
Training samples: 783
Validation samples: 87
--
Training Step: 4 | total loss: 1.08214 | time: 1.327s
| Adam | epoch: 001 | loss: 1.08214 - acc: 0.7549 | val_loss: 0.53043 - val_acc: 0.9885 -- iter: 783/783
--
Training Step: 8 | total loss: 0.41462 | time: 1.117s
| Adam | epoch: 002 | loss: 0.41462 - acc: 0.9759 | val_loss: 0.17027 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 12 | total loss: 0.15111 | time: 1.124s
| Adam | epoch: 003 | loss: 0.15111 - acc: 0.9984 | val_loss: 0.07488 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 16 | total loss: 0.10145 | time: 1.114s
| Adam | epoch: 004 | loss: 0.10145 - acc: 0.9950 | val_loss: 0.04173 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 20 | total loss: 0.26568 | time: 1.124s
| Adam | epoch: 005 | loss: 0.26568 - acc: 0.9615 | val_loss: 0.03077 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 24 | total loss: 0.11023 | time: 1.129s
| Adam | epoch: 006 | loss: 0.11023 - acc: 0.9863 | val_loss: 0.02607 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 28 | total loss: 0.07059 | time: 1.141s
| Adam | epoch: 007 | loss: 0.07059 - acc: 0.9934 | val_loss: 0.01882 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 32 | total loss: 0.03571 | time: 1.122s
| Adam | epoch: 008 | loss: 0.03571 - acc: 0.9977 | val_loss: 0.01524 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 36 | total loss: 0.05084 | time: 1.120s
| Adam | epoch: 009 | loss: 0.05084 - acc: 0.9948 | val_loss: 0.01384 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 40 | total loss: 0.22283 | time: 1.132s
| Adam | epoch: 010 | loss: 0.22283 - acc: 0.9714 | val_loss: 0.01227 - val_acc: 1.0000 -- iter: 783/783
The network used (note that dropout has been commented out):
def get_network_wide(frames, input_size, num_classes):
"""Create a one-layer LSTM"""
net = tflearn.input_data(shape=[None, frames, input_size])
#net = tflearn.lstm(net, 256, dropout=0.2)
net = tflearn.fully_connected(net, num_classes, activation='softmax')
net = tflearn.regression(net, optimizer='adam',
loss='categorical_crossentropy',metric='default', name='output1')
return net
Plot of validation loss vs training loss
At times, the validation loss is greater than the training loss. This may indicate that the model is underfitting. Underfitting occurs when the model is unable to accurately model the training data, and hence generates large errors.
training loss during LSTM training is higher than validation loss.
If your model's accuracy on your testing data is lower than your training or validation accuracy, it usually indicates that there are meaningful differences between the kind of data you trained the model on and the testing data you're providing for evaluation.
The smaller the loss, the better a job the classifier is at modeling the relationship between the input data and the output targets. That said, there is a point where we can overfit our model — by modeling the training data too closely, our model loses the ability to generalize.
This is not necessarily a problematic phenomenon in essence.
It can take place due to many reasons, as stated below.
TLosses [0.60,0.59,...0.3 (loss on TS at the end of the epoch)]
-> VLosses [0.3,0.29,0.35] (because the model has already trained a lot as compared to the start of the epoch.
However, your training set is very small, as well as your validation set. Such a split (90% on train and 10% on validation/development) should be made only when there is very much data (tens of thousands or even hundreds of thousands in this case). On the other hand, your entire training set(train + val) has less than 1000 samples. You need much more data, as LSTMs are well-known for requiring a lot of training data.
Then, you could try using KFoldCrossValidation or even StratifiedKFoldCrossValidation. In this way, you would ensure that you have not manually created a very 'easy' validation set, on which you always test; instead, you can have k-folds, out of which k-1 are used for training and 1 for validation; in this way you may avoid situation (1).
The answer lies in the data. Prepare it carefully, as the results depend significantly on the quality of data (preprocessing data, cleaning data, creating relevant training/validation/test sets).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With