Validation Loss Much Higher Than Training Loss

Tags:

I'm very new to deep learning models, and trying to train a multiple time series model using LSTM with Keras Sequential. There are 25 observations per year for 50 years = 1250 samples, so not sure if this is even possible to use LSTM for such small data. However, I have thousands of feature variables, not including time lags. I'm trying to predict a sequence of the next 25 time steps of data. The data is normalized between 0 and 1. My problem is that, despite trying many obvious adjustments, I cannot get the LSTM validation loss anywhere close to the training loss (overfitting dramatically, I think).

I have tried adjusting number of nodes per hidden layer (25-375), number of hidden layers (1-3), dropout (0.2-0.8), batch_size (25-375), and train/ test split (90%:10% - 50%-50%). Nothing really makes much of a difference on the validation loss/ training loss disparity.

# SPLIT INTO TRAIN AND TEST SETS
# 25 observations per year; Allocate 5 years (2014-2018) for Testing
n_test = 5 * 25
test = values[:n_test, :]
train = values[n_test:, :]

# split into input and outputs
train_X, train_y = train[:, :-25], train[:, -25:]
test_X, test_y = test[:, :-25], test[:, -25:]

# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 5, newdf.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 5, newdf.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)


# design network
model = Sequential()
model.add(Masking(mask_value=-99, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(LSTM(375, return_sequences=True))
model.add(Dropout(0.8))
model.add(LSTM(125, return_sequences=True))
model.add(Dropout(0.8))
model.add(LSTM(25))
model.add(Dense(25))
model.compile(loss='mse', optimizer='adam')
# fit network
history = model.fit(train_X, train_y, epochs=20, batch_size=25, validation_data=(test_X, test_y), verbose=2, shuffle=False)

Epoch 19/20

14s - loss: 0.0512 - val_loss: 188.9568

Epoch 20/20

14s - loss: 0.0510 - val_loss: 188.9537

link to plot of Val Loss / Train Loss

I assume I must be doing something obvious wrong, but can't realize it since I'm a newbie. I am hoping to either get some useful validation loss achieved (compared to training), or know that my data observations are simply not large enough for useful LSTM modeling. Any help or suggestions is much appreciated, thanks!

753

asked Mar 31 '19 19:03

Michael Bolton

1 Answers

Overfitting

In general, if you're seeing much higher validation loss than training loss, then it's a sign that your model is overfitting - it learns "superstitions" i.e. patterns that accidentally happened to be true in your training data but don't have a basis in reality, and thus aren't true in your validation data.

It's generally a sign that you have a "too powerful" model, too many parameters that are capable of memorizing the limited amount of training data. In your particular model you're trying to learn almost a million parameters (try printing model.summary()) from a thousand datapoints - that's not reasonable, learning can extract/compress information from data, not create it out of thin air.

What's the expected result?

The first question you should ask (and answer!) before building a model is about the expected accuracy. You should have a reasonable lower bound (what's a trivial baseline? For time series prediction, e.g. linear regression might be one) and an upper bound (what could an expert human predict given the same input data and nothing else?).

Much depends on the nature of the problem. You really have to ask, is this information sufficient to get a good answer? For many real life time problems with time series prediction, the answer is no - the future state of such a system depends on many variables that can't be determined by simply looking at historical measurements - to reasonably predict the next value, you need to bring in lots of external data other than the historical prices. There's a classic quote by Tukey: "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data."

answered Nov 15 '22 10:11

Peteris

Related questions
                            
                                Does CrossValidator in PySpark distribute the execution?
                            
                                Machine learning - normalizing features with no theoretical maximum value
                            
                                ValueError: X.shape[1] = 15 should be equal to 700, the number of features at training time
                            
                                NLP - Embeddings selection of `start` and `end` of sentence tokens
                            
                                Why does this neural network learn nothing?
                            
                                Training GAN on small dataset of images
                            
                                Keras - model.predict return classes and not probabilities
                            
                                Log Loss function in pyspark
                            
                                Keyword/keyphrase extraction from text [closed]
                            
                                Get the positive and negative words from a Textblob based on its polarity in Python (Sentimental analysis)
                            
                                Selecting a Specific Number of Features via Sklearn's RFECV (Recursive Feature Elimination with Cross-validation)
                            
                                Collaborative Filtering adding new users and items
                            
                                Tensorflow adam optimizer in Keras
                            
                                Targeting a specific metric to optimize in tensorflow
                            
                                Sigmoid function returns 1 for large positive inputs
                            
                                predict_proba() method of Keras model does not exist
                            
                                How to get text objects to work with sklearn classifier pipeline?
                            
                                How to use K.get_session in Tensorflow 2.0 or how to migrate it?
                            
                                How does data normalization work in keras during prediction?
                            
                                What is a weak learner?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Validation Loss Much Higher Than Training Loss

Tags:

machine-learning

deep-learning

keras

lstm