I just started playing with LSTM in keras and I find the possibility of learning the behaviour of time series very fascinating. I've read several tutorials and articles online, most of them showing impressive capabilities in predicting time series, so I gave it a go. The first thing that I noticed is that all the articles I've found always use the validation data in av very unfair way. My idea of predicting a time series is that I use the training data to build a model and use the last N elements of the training data to estimate the future behaviour of the series. To do that, the model has to use its own predictions as inputs to step forward in the future.
What I've seen people doing, instead, is to estimate the accuracy of the testing set at any time in the future, using the ground truth as input for the estimation. This is very unfair, because it does not produce a real prediction!
I tried to code my own LSTM prediction in Keras (please find the code below), and I started with a relatively simple case, a combination of a parabola and a sinusoid. Unfortuntely, results are quite unsatisfying. Here are a few examples, obtained by changing the parameters of the network:
Do you have any suggestion to get better results? How can LSTM predict complex behaviours if they can't predict such a "simple" signal?
Thank you, Alessandro
import os
import numpy as np
from matplotlib import pyplot as plt
import keras
# Number of vectors to consider in the time window
look_back = 50
N_datapoints = 2000
train_split = 0.8
# Generate a time signal composed of a linear function and a sinusoid
t = np.linspace(0, 200, N_datapoints)
y = t**2 + np.sin(t*2)*1000
y -= y.mean()
y /= y.std()
plt.plot(y)
# Reshape the signal into fixed windows for training
def create_blocks(y, look_back=1):
x_data, y_data = [], []
for i in range(0, len(y)-look_back-1):
x_data.append(y[i:i+look_back])
y_data.append(y[i+look_back])
return np.array(x_data), np.array(y_data)
x_data, y_data = create_blocks(y, look_back)
# Split data in training and testing
N_train = int(x_data.shape[0]*train_split)
x_train = x_data[:N_train, :, None]
y_train = y_data[:N_train, ]
x_test = x_data[N_train:-1, :, None]
y_test = y_data[N_train:-1:, ]
# Get the time vector for train and test (just to plot)
t_train = t[0:N_train-1, None]
t_test = t[N_train:-1, None]
# Network
from keras import Model, Input
from keras.layers import LSTM, Dense, Activation, BatchNormalization, Dropout
inputs = Input(shape=(look_back, 1))
net = LSTM(32, return_sequences=False)(inputs)
net = Dense(32)(net)
net = Dropout(0.25)(net)
outputs = Dense(1)(net)
model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=keras.optimizers.rmsprop(), loss='mean_squared_error')
model.summary()
# Callback
from keras.callbacks import Callback
class PlotResuls(Callback):
def on_train_begin(self, logs=None):
self.fig = plt.figure()
def save_data(self, x_test, y, look_back, t_test):
self.x_test = x_test
self.y = y
self.t_test = t_test
self.look_back = look_back
def on_epoch_end(self, epoch, logs=None):
if epoch % 20 == 0:
plt.clf()
y_pred = self.x_test[0, ...]
for i in range(len(x_test)+1):
new_prediction = model.predict(y_pred[None, -self.look_back:, ])
y_pred = np.concatenate((y_pred, new_prediction), axis=0)
plt.plot(t, y, label='GT')
plt.plot(self.t_test, y_pred, '.-', label='Predicted')
plt.legend()
plt.pause(0.01)
plt.savefig('lstm_%d.png' % epoch)
plot_results = PlotResuls()
plot_results.save_data(x_test, y, look_back, t_test)
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=100000, batch_size=32, callbacks=[plot_results])
As already shown in Primusa's answer, it is helpful to allow the recurrent layer to output its hidden state using return_sequences=True
, together with a Bidirectional
layer which has been shown to better capture temporal patterns. Additionally, I would argue that you need to have some sort of an intuition towards the kind of function you're trying to approximate- trying to decompose it into a number of functions and constructing a sub-network for each usually speeds up the learning process, especially so when using appropriate activation combinations. Applying weight regularization is also relevant as it may stop the extreme divergence due to error accumulation. Note also that unless you use stateful=True
, you will need to provide the network with a long enough time frame to inspect long-range patterns (i.e. the parabola is easy to approximate as a line if the time-window is small).
Concretely, the below alterations achieve a (still rapidly decreasing) MSE of (1.0223e-04/0.0015 ) after 20 epochs and (2.8111e-05/3.0393e-04) after 100 epochs, with a lookback of 100 (note that I have also changed your optimizer to Adam which I simply prefer):
from keras import Model, Input
from keras.layers import (LSTM, Dense, Activation, BatchNormalization,
Dropout, Bidirectional, Add)
inputs = Input(shape=(look_back, 1))
bd_seq = Bidirectional(LSTM(128, return_sequences=True,
kernel_regularizer='l2'),
merge_mode='sum')(inputs)
bd_sin = Bidirectional(LSTM(32, return_sequences=True,
kernel_regularizer='l2'),
merge_mode='sum') (bd_seq)
bd_1 = Bidirectional(LSTM(1, activation='linear'),
merge_mode='sum')(bd_seq)
bd_2 = Bidirectional(LSTM(1, activation='tanh'),
merge_mode='sum')(bd_sin)
output = Add()([bd_1, bd_2])
model = Model(inputs=inputs, outputs=output)
model.compile(optimizer='adam', loss='mean_squared_error')
While neural networks are very complex and powerful, they aren't a magic box. Many times you need to fine-tune your networks in order to get better results.
I tweaked your model and I got these results:
While they are by no means extremely accurate, I would say they are much better than the results you have posted in your question. This neural network has a clear sense of the frequency of the wave, but needs a bit more work on determining the general trend of the line. You can see how its ability to predict worsened as it approached the max of the curve.
The model I used was:
model = Sequential()
model.add(Bidirectional(LSTM(8, return_sequences=True),input_shape=(50, 1),))
model.add(LSTM(8, return_sequences=True))
model.add(LSTM(4, return_sequences=False))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
I shortened your look_back period from 100 to 50 to shorten the training time. I trained the model for fifty epochs on a batch size of 5:
c.fit(epochs=50, batch_size=5)
This took about 15 minutes on my laptop (trained on CPU not GPU).
The main trick that I used to boost its accuracy was the Bidirectional LSTM, which involves two LSTMS, one with the sequences fed in forwards and the other with the sequences fed in backwards. The idea of this is to use future data to understand the context of the curve.
Do note that the use of future data is only during training. During actual predictions only previous data is used to predict the next point, and I used your "rolling predictions" idea as well, where the predictions are later used as inputs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With