I am trying to understand how to correctly feed data into my keras model to classify multivariate time series data into three classes using a LSTM neural network.
I looked at different resources already - mainly these three excellent blog posts by Jason Brownlee post1, post2, post3), other SO questions and different papers, but none of the information given there exactly fits my problem case, and I was not able to figure out if my data preprocessing / feeding it into the model is correct, so I guessed I might get some help if I specify my exact conditions here.
What I am trying to do is classify multivariate time series data, which in its original form is structured as follows:
I have 200 samples
One sample is one csv file.
A sample can have 1 to 50 features (i.e. the csv file has 1 to 50 columns).
Each feature has its value "tracked" over a fixed amount of time steps, let's say 100 (i.e. each csv file has exactly 100 rows).
Each csv file has one of three classes ("good", "too small", "too big")
So what my current status looks like is the following:
I have a numpy array "samples" with the following structure:
# array holding all samples
[
# sample 1
[
# feature 1 of sample 1
[ 0.1, 0.2, 0.3, 0.2, 0.3, 0.1, 0.2, 0.4, 0.5, 0.1, ... ], # "time series" of feature 1
# feature 2 of sample 1
[ 0.5, 0.6, 0.7, 0.6, 0.4, 0.3, 0.2, 0.1, -0.1, -0.2, ... ], # "time series" of feature 2
... # up to 50 features
],
# sample 2
[
# feature 1 of sample 2
[ 0.1, 0.2, 0.3, 0.2, 0.3, 0.1, 0.2, 0.4, 0.5, 0.1, ... ], # "time series" of feature 1
# feature 2 of sample 2
[ 0.5, 0.6, 0.7, 0.6, 0.4, 0.3, 0.2, 0.1, -0.1, -0.2, ... ], # "time series" of feature 2
... # up to 50 features
],
... # up to sample no. 200
]
I also have a numpy array "labels" with the same length as the "samples" array (i.e. 200). The labels are encoded in the following way:
[0, 2, 2, 1, 0, 1, 2, 0, 0, 0, 1, 2, ... ] # up to label no. 200
This "labels" array is then encoded with keras' to_categorical
function
to_categorical(labels, len(np.unique(labels)))
My model definition currently looks like that:
max_nb_features = 50
nb_time_steps = 100
model = Sequential()
model.add(LSTM(5, input_shape=(max_nb_features, nb_time_steps)))
model.add(Dense(3, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
I then split the data into training / testing data:
samples_train, samples_test, labels_train, labels_test = train_test_split(samples, labels, test_size=0.33)
This leaves us with 134 samples for training and 66 samples for testing.
The problem I'm currenty running into, is that the following code is not working:
model.fit(samples_train, labels_train, epochs=1, batch_size=1)
The error is the following:
Traceback (most recent call last):
File "lstm_test.py", line 152, in <module>
model.fit(samples_train, labels_train, epochs=1, batch_size=1)
File "C:\Program Files\Python36\lib\site-packages\keras\models.py", line 1002, in fit
validation_steps=validation_steps)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\training.py", line 1630, in fit
batch_size=batch_size)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\training.py", line 1476, in _standardize_user_data
exception_prefix='input')
File "C:\Program Files\Python36\lib\site-packages\keras\engine\training.py", line 113, in _standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking input: expected lstm_1_input to have 3 dimensions, but got array with shape (134, 1)
For me, it seems to not work because of the variable amount of features my samples can have. If I use "fake" (generated) data, where all parameters are the same, except each sample has exactly the same amount of features (50), the code works.
Now what I'm trying to understand is:
batch_size
, input_shape
) correct / sensible?Multivariate time series classification is a machine learning task with increasing importance due to the proliferation of information sources in different domains (economy, health, energy, crops, etc.).
I was trying to forecast the future values of a variable where it not only depends on the previous values of itself but it also depends on the previous/current values of the other variables.
ARIMAX is an extended version of the ARIMA model which utilizes multivariate time series forecasting using multiple time series which are provided as exogenous variables to forecast the dependent variable.
Time series classification uses supervised machine learning to analyze multiple labeled classes of time series data and then predict or classify the class that a new data set belongs to.
I believe the input shape for Keras should be:
input_shape=(number_of_samples, nb_time_steps, max_nb_features).
And most often nb_time_steps = 1
P.S.: I tried solving a very similar problem for an internship position (but my results turned out to be wrong). You may take a look here: https://github.com/AbbasHub/Deep_Learning_LSTM/blob/master/2018-09-22_Multivariate_LSTM.ipynb (see if you can spot my mistake!)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With