I am new to deep learning and currently working on using LSTMs for language modeling. I was looking at the pytorch documentation and was confused by it.
If I create a
nn.LSTM(input_size, hidden_size, num_layers)
where hidden_size = 4 and num_layers = 2, I think I will have an architecture something like:
op0 op1 ....
LSTM -> LSTM -> h3
LSTM -> LSTM -> h2
LSTM -> LSTM -> h1
LSTM -> LSTM -> h0
x0 x1 .....
If I do something like
nn.LSTM(input_size, hidden_size, 1)
nn.LSTM(input_size, hidden_size, 1)
I think the network architecture will look exactly like above. Am I wrong? And if yes, what is the difference between these two?
num_layers in RNN is just stacking RNNs on top of each other. So you get a hidden from each layer and an output only from the topmost layer.
The output of the Pytorch LSTM layer is a tuple with two elements.
Generally, 2 layers have shown to be enough to detect more complex features. More layers can be better but also harder to train. As a general rule of thumb — 1 hidden layer work with simple problems, like this, and two are enough to find reasonably complex features.
Here the hidden_size of the LSTM layer would be 512 as there are 512 units in each LSTM cell and the num_layers would be 2. The num_layers is the number of layers stacked on top of each other.
The multi-layer LSTM is better known as stacked LSTM where multiple layers of LSTM are stacked on top of each other.
Your understanding is correct. The following two definitions of stacked LSTM are same.
nn.LSTM(input_size, hidden_size, 2)
and
nn.Sequential(OrderedDict([
('LSTM1', nn.LSTM(input_size, hidden_size, 1),
('LSTM2', nn.LSTM(hidden_size, hidden_size, 1)
]))
Here, the input is feed into the lowest layer of LSTM and then the output of the lowest layer is forwarded to the next layer and so on so forth. Please note, the output size of the lowest LSTM layer and the rest of the LSTM layer's input size is hidden_size
.
However, you may have seen people defined stacked LSTM in the following way:
rnns = nn.ModuleList()
for i in range(nlayers):
input_size = input_size if i == 0 else hidden_size
rnns.append(nn.LSTM(input_size, hidden_size, 1))
The reason people sometimes use the above approach is that if you create a stacked LSTM using the first two approaches, you can't get the hidden states of each individual layer. Check out what LSTM returns in PyTorch.
So, if you want to have the intermedia layer's hidden states, you have to declare each individual LSTM layer as a single LSTM and run through a loop to mimic the multi-layer LSTM operations. For example:
outputs = []
for i in range(nlayers):
if i != 0:
sent_variable = F.dropout(sent_variable, p=0.2, training=True)
output, hidden = rnns[i](sent_variable)
outputs.append(output)
sent_variable = output
In the end, outputs
will contain all the hidden states of each individual LSTM layer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With