I'm having trouble understanding the documentation for PyTorch's LSTM module (and also RNN and GRU, which are similar). Regarding the outputs, it says:
Outputs: output, (h_n, c_n)
- output (seq_len, batch, hidden_size * num_directions): tensor containing the output features (h_t) from the last layer of the RNN, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence.
- h_n (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t=seq_len
- c_n (num_layers * num_directions, batch, hidden_size): tensor containing the cell state for t=seq_len
It seems that the variables output
and h_n
both give the values of the hidden state. Does h_n
just redundantly provide the last time step that's already included in output
, or is there something more to it than that?
The output of the Pytorch LSTM layer is a tuple with two elements. The first element of the tuple is LSTM's output corresponding to all timesteps ( hᵗ : ∀t = 1,2… T ) with shape (timesteps, batch, output_features) . The second element of the tuple is another tuple with two elements.
Here the hidden_size of the LSTM layer would be 512 as there are 512 units in each LSTM cell and the num_layers would be 2. The num_layers is the number of layers stacked on top of each other.
The output of an LSTM cell or layer of cells is called the hidden state. This is confusing, because each LSTM cell retains an internal state that is not output, called the cell state, or c.
LSTM Default return value: The size of output is 2D array of real numbers. The first dimension is indicating the number of samples in the batch given to the LSTM layer. The second dimension is the dimensionality of the output space defined by the units parameter in Keras LSTM implementation.
I made a diagram. The names follow the PyTorch docs, although I renamed num_layers
to w
.
output
comprises all the hidden states in the last layer ("last" depth-wise, not time-wise). (h_n, c_n)
comprises the hidden states after the last timestep, t = n, so you could potentially feed them into another LSTM.
The batch dimension is not included.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With