Preamble
I am currently working on a Machine Learning problem where we are tasked with using past data on product sales in order to predict sales volumes going forward (so that shops can better plan their stocks). We essentially have time series data, where for each and every product we know how many units were sold on which days. We also have information like what the weather was like, whether there was a public holiday, if any of the products were on sales etc.
We've been able to model this with some success using an MLP with dense layers, and just using a sliding window approach to include sales volumes from the surrounding days. However, we believe we'll be able to get much better results with a time-series approach such as an LSTM.
Data
The data we have essentially is as follows:
(EDIT: for clarity the "Time" column in the picture above is not correct. We have inputs once per day, not once per month. But otherwise the structure is the same!)
So the X data is of shape:
(numProducts, numTimesteps, numFeatures) = (50 products, 1096 days, 90 features)
And the Y data is of shape:
(numProducts, numTimesteps, numTargets) = (50 products, 1096 days, 3 binary targets)
So we have data for three years (2014, 2015, 2016) and want to train on this in order to make predictions for 2017. (That's of course not 100% true, since we actually have data up to Oct 2017, but let's just ignore that for now)
Problem
I would like to build an LSTM in Keras that allows me to make these predictions. There are a few places where I am getting stuck though. So I have six concrete questions (I know one is supposed to try to limit a Stackoverflow post to one question, but these are all intertwined).
Firstly, how would I slice up my data for the batches? Since I have three full years, does it make sense to simply push through three batches, each time of size one year? Or does it make more sense to make smaller batches (say 30 days) and also to using sliding windows? I.e. instead of 36 batches of 30 days each, I use 36 * 6 batches of 30 days each, each time sliding with 5 days? Or is this not really the way LSTMs should be used? (Note that there is quite a bit of seasonality in the data, to I need to catch that kind of long-term trend as well).
Secondly, does it make sense to use return_sequences=True
here? In other words, I keep my Y data as is (50, 1096, 3)
so that (as far as I've understood it) there is a prediction at every time step for which a loss can be calculated against the target data? Or would I be better off with return_sequences=False
, so that only the final value of each batch is used to evaluate the loss (i.e. if using yearly batches, then in 2016 for product 1, we evaluate against the Dec 2016 value of (1,1,1)
).
Thirdly how should I deal with the 50 different products? They are different, but still strongly correlated and we've seen with other approaches (for example an MLP with simple time-windows) that the results are better when all products are considered in the same model. Some ideas that are currently on the table are:
Fourthly, how do I deal with validation data? Normally I would just keep out a randomly selected sample to validate against, but here we need to keep the time ordering in place. So I guess the best is to just keep a few months aside?
Fifthly, and this is the part that is probably the most unclear to me - how can I use the actual results to perform predictions? Let's say I used return_sequences=False
and I trained on all three years in three batches (each time up to Nov) with the goal of training the model to predict the next value (Dec 2014, Dec 2015, Dec 2016). If I want to use these results in 2017, how does this actually work? If I understood it correctly, the only thing I can do in this instance is to then feed the model all the data points for Jan to Nov 2017 and it will give me back a prediction for Dec 2017. Is that correct? However, if I were to use return_sequences=True
, then trained on all data up to Dec 2016, would I then be able to get a prediction for Jan 2017 just by giving the model the features observed at Jan 2017? Or do I need to also give it the 12 months before Jan 2017? What about Feb 2017, do I in addition need to give the value for 2017, plus a further 11 months before that? (If it sounds like I'm confused, it's because I am!)
Lastly, depending on what structure I should use, how do I do this in Keras? What I have in mind at the moment is something along the following lines: (though this would be for only one product, so doesn't solve having all products in the same model):
Keras code
trainX = trainingDataReshaped #Data for Product 1, Jan 2014 to Dec 2016 trainY = trainingTargetReshaped validX = validDataReshaped #Data for Product 1, for ??? Maybe for a few months? validY = validTargetReshaped numSequences = trainX.shape[0] numTimeSteps = trainX.shape[1] numFeatures = trainX.shape[2] numTargets = trainY.shape[2] model = Sequential() model.add(LSTM(100, input_shape=(None, numFeatures), return_sequences=True)) model.add(Dense(numTargets, activation="softmax")) model.compile(loss=stackEntry.params["loss"], optimizer="adam", metrics=['accuracy']) history = model.fit(trainX, trainY, batch_size=30, epochs=20, verbose=1, validation_data=(validX, validY)) predictX = predictionDataReshaped #Data for Product 1, Jan 2017 to Dec 2017 prediction=model.predict(predictX)
Prophet's advantage is that it requires less hyperparameter tuning as it is specifically designed to detect patterns in business time series. LSTM-based recurrent neural networks are probably the most powerful approach to learning from sequential data and time series are only a special case.
The results show that additional training of data and thus BiLSTM-based modeling offers better predictions than regular LSTM-based models. More specifically, it was observed that BiLSTM models provide better predictions compared to ARIMA and LSTM models.
ARIMA model produced lower error values than LSTM model in monthly and weekly series which indicated that ARIMA was more successful than LSTM for monthly and weekly forecasting. While the error values produced by LSTM were lower than those by ARIMA for daily forecasting in rolling forecasting model.
How to stack multiple LSTMs in keras? The solution is to add return_sequences=True to all LSTM layers except the last one so that its output tensor has ndim=3 (i.e. batch size, timesteps, hidden state). Setting this flag to true lets Keras know that LSTM output should contain all historical generated outputs along with time stamps (3D).
Multivariate Time Series Forecasting with LSTMs in Keras By Jason Brownlee on August 14, 2017 in Deep Learning for Time Series Last Updated on October 21, 2020 Neural networks like Long Short-Term Memory (LSTM) recurrent neural networks are able to almost seamlessly model problems with multiple input variables.
The next layer in our Keras LSTM network is a dropout layer to prevent overfitting. After that, there is a special Keras layer for use in recurrent neural networks called TimeDistributed. This function adds an independent layer for each time step in the recurrent model.
The alternate way of building networks in Keras is the Functional API, which I used in my Word2Vec Keras tutorial. Basically, the sequential methodology allows you to easily stack layers into your network without worrying too much about all the tensors (and their shapes) flowing through the model.
So:
Firstly, how would I slice up my data for the batches? Since I have three full years, does it make sense to simply push through three batches, each time of size one year? Or does it make more sense to make smaller batches (say 30 days) and also to using sliding windows? I.e. instead of 36 batches of 30 days each, I use 36 * 6 batches of 30 days each, each time sliding with 5 days? Or is this not really the way LSTMs should be used? (Note that there is quite a bit of seasonality in the data, to I need to catch that kind of long-term trend as well).
Honestly - modeling such data is something really hard. First of all - I wouldn't advise you to use LSTM
s as they are rather designed to capture a little bit different kind of data (e.g. NLP or speech where it's really important to model long-term dependencies - not seasonality) and they need a lot of data in order to be learned. I would rather advise you to use either GRU
or SimpleRNN
which are way easier to learn and should be better for your task.
When it comes to batching - I would definitely advise you to use a fixed window technique as it will end up in producing way more data points than feeding a whole year or a whole month. Try to set a number of days as meta parameter which will be also optimized by using different values in training and choosing the most suitable one.
When it comes to seasonality - of course, this is a case but:
What I advise you to do instead is:
Secondly, does it make sense to use return_sequences=True here? In other words, I keep my Y data as is (50, 1096, 3) so that (as far as I've understood it) there is a prediction at every time step for which a loss can be calculated against the target data? Or would I be better off with return_sequences=False, so that only the final value of each batch is used to evaluate the loss (i.e. if using yearly batches, then in 2016 for product 1, we evaluate against the Dec 2016 value of (1,1,1)).
Using return_sequences=True
might be useful but only in following cases:
LSTM
(or another recurrent layer) will be followed by yet another recurrent layer.The way described in a second point might be an interesting approach but keep the mind in mind that it might be a little bit hard to implement as you will need to rewrite your model in order to obtain a production result. What also might be harder is that you'll need to test your model against many types of time instabilities - and such approach might make this totally unfeasible.
Thirdly how should I deal with the 50 different products? They are different, but still strongly correlated and we've seen with other approaches (for example an MLP with simple time-windows) that the results are better when all products are considered in the same model. Some ideas that are currently on the table are:
- change the target variable to be not just 3 variables, but 3 * 50 = 150; i.e. for each product there are three targets, all of which are trained simultaneously.
- split up the results after the LSTM layer into 50 dense networks, which take as input the ouputs from the LSTM, plus some features that are specific to each product - i.e. we get a multi-task network with 50 loss functions, which we then optimise together. Would that be crazy?
- consider a product as a single observation, and include product-specific features already at the LSTM layer. Use just this one layer followed by an ouput layer of size 3 (for the three targets). Push through each product in a separate batch.
I would definitely go for a first choice but before providing a detailed explanation I will discuss disadvantages of 2nd and 3rd ones:
Before getting to my choice - let's discuss yet another issue - redundancies in your dataset. I guess that you have 3 kinds of features:
Now you have table of size (timesteps, m * n, products)
. I would transform it into table of shape (timesteps, products * m + n)
as general features are the same for all products. This will save you a lot of memory and also make it feasible to feed to recurrent network (keep in mind that recurrent layers in keras
have only one feature dimension - whereas you had two - product
and feature
ones).
So why the first approach is the best in my opinion? Becasue it takes advantage of many interesting dependencies from data. Of course - this might harm the training process - but there is an easy trick to overcome this: dimensionality reduction. You could e.g. train PCA
on your 150 dimensional vector and reduce it size to a much smaller one - thanks to what you have your dependencies modeled by PCA
and your output has a much more feasible size.
Fourthly, how do I deal with validation data? Normally I would just keep out a randomly selected sample to validate against, but here we need to keep the time ordering in place. So I guess the best is to just keep a few months aside?
This is a really important question. From my experience - you need to test your solution against many types of instabilities in order to be sure that it works fine. So a few rules which you should keep in mind:
The last point might be a little bit vague - so to provide you some examples:
Of course - you could try yet another hold outs.
Fifthly, and this is the part that is probably the most unclear to me - how can I use the actual results to perform predictions? Let's say I used return_sequences=False and I trained on all three years in three batches (each time up to Nov) with the goal of training the model to predict the next value (Dec 2014, Dec 2015, Dec 2016). If I want to use these results in 2017, how does this actually work? If I understood it correctly, the only thing I can do in this instance is to then feed the model all the data points for Jan to Nov 2017 and it will give me back a prediction for Dec 2017. Is that correct? However, if I were to use return_sequences=True, then trained on all data up to Dec 2016, would I then be able to get a prediction for Jan 2017 just by giving the model the features observed at Jan 2017? Or do I need to also give it the 12 months before Jan 2017? What about Feb 2017, do I in addition need to give the value for 2017, plus a further 11 months before that? (If it sounds like I'm confused, it's because I am!)
This depends on how you've built your model:
return_sequences=True
you need to rewrite it to have return_sequence=False
or just taking the output and considering only the last step from result,if you used a varying length - you could feed any timesteps proceding your prediction period you want (but I advice you to feed at least 7 proceding days).
Lastly, depending on what structure I should use, how do I do this in Keras? What I have in mind at the moment is something along the following lines: (though this would be for only one product, so doesn't solve having all products in the same model)
Here - more info on what kind of model you've choosed is needed.
There are several approaches for this problem. The one that you propose seems to be a sliding window.
But in fact you don't need to slice the time dimension, you can input all 3 years at once. You may slice the products dimension, in case your batch gets too big for the memory and speed.
You can work with a single array with shape (products, time, features)
Yes, it makes sense to use return_sequences=True
.
If I understood your question correctly, you have y
predictions for every day, right?
That is really an open question. All approaches have their advantages.
But if you're considering to put all the product features together, being these features of different nature, you should probably expand all possible features as if there were a big one-hot vector considering all features of all products.
If each product has independent features that apply only to itself, the idea of creating individual models for each product doesn't seem insane to me.
You might also thing of making the product id a one-hot vector input, and use a single model.
Depending on which approach you choose, you may:
There may be also many approaches.
There are approaches where you use sliding windows. You train your model for fixed time lengths.
And there are approaches where you train the LSTM layers with the entire length. In this case you'd first predict the entire known part, and then start predicting the unknown part.
My question: is the
X
data known for the period where you have to predictY
? OfX
is also unknown in this period, so you have also to predictX
?
I recommend you to take a look at this question and its answer: How to deal with multi-step time series forecasting in multivariate LSTM in keras
See also this notebook that manages to demonstrate the idea: https://github.com/danmoller/TestRepo/blob/master/TestBookLSTM.ipynb
In this notebook, though, I used an approach that puts X and Y as inputs. And we predict future X and Y.
You can try creating a model (if that's the case) only to predict X. Then a second model to predict Y from X.
In another case (if you already have all X data, no need to predict X), you can create a model that only predicts Y from X. (You'd still follow part of the method in the notebook, where you first predict the already known Y just to make your model get adjusted to where in the sequence it is, then you predict the unknown Y) -- This can be done in one single full-length X input (which contains the training X at the beginning and the test X at the end).
Knowing which approach and which kind of model to choose is probably the exact answer to win the competition... so, there isn't a best answer for this question, every competitor is trying to find out this answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With