I am trying to build a predictive model on stock prices. From what I've read, LSTM is a good layer to use. I can't fully understand what my input_shape
needs to be for my model though.
Here is the tail
of my DataFrame
I then split the data into train / test
labels = df['close'].values
x_train_df = df.drop(columns=['close'])
x_train, x_test, y_train, y_test = train_test_split(x_train_df.values, labels, test_size=0.2, shuffle=False)
min_max_scaler = MinMaxScaler()
x_train = min_max_scaler.fit_transform(x_train)
x_test = min_max_scaler.transform(x_test)
print('y_train', y_train.shape)
print('y_test', y_test.shape)
print('x_train', x_train.shape)
print('x_test', x_test.shape)
print(x_train)
This yields:
Here's where I am getting confused. Running the simple example, I get the following error:
ValueError: Input 0 of layer lstm_15 is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: [None, 1, 4026, 5]
I've tried various combinations of messing with the input_shape
and have came to the conclusion, I have no idea how to determine the input shape.
model = Sequential()
model.add(LSTM(32, input_shape=(1, x_train.shape[0], x_train.shape[1])))
model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10)
Given my dataframe, what should be my input_shape
? I understand that the input shape is batch size
, timesteps
, data dim
. Just not clear how to map those words to my actual data as what I've thought the values were, are actually not.
I was thinking:
The input of the LSTM is always is a 3D array. (batch_size, time_steps, seq_len) . The output of the LSTM could be a 2D array or 3D array depending upon the return_sequences argument.
LSTM layer is a recurrent layer, hence it expects a 3-dimensional input (batch_size, timesteps, input_dim) .
This is one timestep input, output and the equations for a time unrolled representation. The LSTM has an input x(t) which can be the output of a CNN or the input sequence directly. h(t-1) and c(t-1) are the inputs from the previous timestep LSTM. o(t) is the output of the LSTM for this timestep.
There are 3 inputs to the LSTM cell: ht−1 previous timestep (t-1) Hidden State value. ct−1 previous timestep (t-1) Cell State value. xt current timestep (t) Input value.
First of all, I don't think that you need an LSTM at all. Based on the df.tail()
, it appears to me that there is no temporal dependence between the rows of the pandas data-frame (the samples of your dataset). Anyways, I will come back to that later, firstly your question:
Batch size: The number of elements in the batch. In total, the dataset contains 4026 elements. On the other hand, the batch size is the number of elements that are processed in a single batch. Let us assume that it is 2. In that case you will have 2013
of these batches.
Time steps: Equal to the number of samples which have a temporal dependence between them. Assuming that in your dataset each 3 instances constitute data sequence, then the time steps will be 3
. Therefore, it follows that each sample in the dataset now consists of 3
measurements, so the total number of elements is 1342
(at the start was 4026
).
Data dimension: The number of features for each element in the batch, for each time step - in your case 5
, assuming that buy
is the label and date
is the temporal dependence column.
As a result, the shape of a single batch of data should be (2, 3, 6)
, while the shape of the whole dataset would be (1342, 3, 6)
. Please note that these shapes are valid if we consider that you use LSTM and that there is temporal dependence between the each of the time steps. Here is an example code snippet to verify some stuff:
# Random training data
x_train = np.random.rand(1342, 3, 6)
# Random training labels
y_train = np.random.randint(0, 2, 1342)
model = Sequential()
model.add(LSTM(32, input_shape=(3, 6)))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=2, batch_size=32)
Now, back to my previous point. Looking at df.tail()
, it seems to me that there is no temporal dependence whatsoever between the samples in the dataset. With that said, I would firstly convert the date
column to something meaningful (one-hot encoding of the month in the year, one-hot encoding of the 4 seasons depending on the month, etc). Then, instead of constructing an RNN, I will proceed with a feed-forward neural network with a binary classification output layer.
As for the model, once you take care of all data related stuff, something as simple as this should work for you:
# Random training data
x_train = np.random.rand(4026, 5)
# Random training labels
y_train = np.random.randint(0, 2, 4026)
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=5))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x_train y_train, epochs=2, batch_size=32)
Hope that it helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With