I am wondering how LSTM work in Keras. In this tutorial for example, as in many others, you can find something like this :
model.add(LSTM(4, input_shape=(1, look_back)))
What does the "4" mean. Is it the number of neuron in the layer. By neuron, I mean something that for each instance gives a single output ?
Actually, I found this brillant discussion but wasn't really convinced by the explanation mentioned in the reference given.
On the scheme, one can see the num_units
illustrated and I think I am not wrong in saying that each of this unit is a very atomic LSTM unit (i.e. the 4 gates). However, how these units are connected ? If I am right (but not sure), x_(t-1)
is of size nb_features
, so each feature would be an input of a unit and num_unit
must be equal to nb_features
right ?
Now, let's talk about keras. I have read this post and the accepted answer and get trouble. Indeed, the answer says :
Basically, the shape is like (batch_size, timespan, input_dim), where input_dim
can be different from the unit
In which case ? I am in trouble with the previous reference...
Moreover, it says,
LSTM in Keras only define exactly one LSTM block, whose cells is of unit-length.
Okay, but how do I define a full LSTM layer ? Is it the input_shape
that implicitely create as many blocks as the number of time_steps
(which, according to me is the first parameter of input_shape
parameter in my piece of code ?
Thanks for lighting me
EDIT : would it also be possible to detail clearly how to reshape data of, say, size (n_samples, n_features)
for a stateful LSTM model ? How to deal with time_steps and batch_size ?
First, units
in LSTM is NOT the number of time_steps.
Each LSTM cell(present at a given time_step) takes in input x
and forms a hidden state vector a
, the length of this hidden unit vector is what is called the units
in LSTM(Keras).
You should keep in mind that there is only one RNN cell created by the code
keras.layers.LSTM(units, activation='tanh', …… )
and RNN operations are repeated by Tx times by the class itself.
I've linked this to help you understand it better in with a very simple code.
You can (sort of) think of it exactly as you think of fully connected layers. Units are neurons.
The dimension of the output is the number of neurons, as with most of the well known layer types.
The difference is that in LSTMs, these neurons will not be completely independent of each other, they will intercommunicate due to the mathematical operations lying under the cover.
Before going further, it might be interesting to take a look at this very complete explanation about LSTMs, its inputs/outputs and the usage of stative = true/false: Understanding Keras LSTMs. Notice that your input shape should be input_shape=(look_back, 1)
. The input shape goes for (time_steps, features)
.
Where input_shape = (batch_size, arbitrary_steps, 3)
Each LSTM layer will keep reusing the same units/neurons over and over until all the arbitrary timesteps in the input are processed.
(batch, arbitrary_steps, units)
if return_sequences=True
.(batch, units)
if return_sequences=False
.units
.units
.To be really precise, there will be two groups of units, one working on the raw inputs, the other working on already processed inputs coming from the last step. Due to the internal structure, each group will have a number of parameters 4 times bigger than the number of units (this 4 is not related to the image, it's fixed).
Flow:
return_sequences=False
) or all (return_sequences = True
) steps
return_sequences=False
) or all (return_sequences = True
) stepsThe number of units is the size (length) of the internal vector states, h
and c
of the LSTM. That is no matter the shape of the input, it is upscaled (by a dense transformation) by the various kernels for the i
, f
, and o
gates. The details of how the resulting latent features are transformed into h
and c
are described in the linked post. In your example, the input shape of data
(batch_size, timesteps, input_dim)
will be transformed to
(batch_size, timesteps, 4)
if return_sequences
is true, otherwise only the last h
will be emmited making it (batch_size, 4)
. I would recommend using a much higher latent dimension, perhaps 128 or 256 for most problems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With