In tensorflow/keras, we can simply set return_sequences = False
for the last LSTM layer before the classification/fully connected/activation (softmax/sigmoid) layer to get rid of the temporal dimension.
In PyTorch, I don't find anything similar. For the classification task, I don't need a sequence to sequence model but many to one architecture like this:
Here's my simple bi-LSTM model.
import torch
from torch import nn
class BiLSTMClassifier(nn.Module):
def __init__(self):
super(BiLSTMClassifier, self).__init__()
self.embedding = torch.nn.Embedding(num_embeddings = 65000, embedding_dim = 64)
self.bilstm = torch.nn.LSTM(input_size = 64, hidden_size = 8, num_layers = 2,
batch_first = True, dropout = 0.2, bidirectional = True)
# as we have 5 classes
self.linear = nn.Linear(8*2*512, 5) # last dimension
def forward(self, x):
x = self.embedding(x)
print(x.shape)
x, _ = self.bilstm(x)
print(x.shape)
x = self.linear(x.reshape(x.shape[0], -1))
print(x.shape)
# create our model
bilstmclassifier = BiLSTMClassifier()
If I observe the shapes after each layer,
xx = torch.tensor(X_encoded[0]).reshape(1,512)
print(xx.shape)
# torch.Size([1, 512])
bilstmclassifier(xx)
#torch.Size([1, 512, 64])
#torch.Size([1, 512, 16])
#torch.Size([1, 5])
What can I do so that the last LSTM returns a tensor with shape (1, 16)
instead of (1, 512, 16)
?
The output of the Pytorch LSTM layer is a tuple with two elements. The first element of the tuple is LSTM's output corresponding to all timesteps ( hᵗ : ∀t = 1,2… T ) with shape (timesteps, batch, output_features) . The second element of the tuple is another tuple with two elements.
Here the hidden_size of the LSTM layer would be 512 as there are 512 units in each LSTM cell and the num_layers would be 2. The num_layers is the number of layers stacked on top of each other.
Within PyTorch, a Linear (or Dense) layer is defined as, y = x A^T + b where A and b are the weight matrix and bias vector for a Linear layer (see here).
The simplest way to do this is by indexing into the tensor:
x = x[:, -1, :]
where x
is the RNN output. Of course, if batch_first
is False
, one would have to use x[-1, :, :]
(or just x[-1]
) to index into the time axis instead. Turns out this is the same thing Tensorflow/Keras do. The relevant code can be found in K.rnn
here:
last_output = tuple(o[-1] for o in outputs)
Note that the code at this point uses time_major
data format, so the index is into the first axis. Also, outputs
is a tuple because it can be multiple layers, state/cell pairs etc., but it is generally the sequence of outputs for all time steps.
This is then used in the RNN
class as follows:
if self.return_sequences:
output = K.maybe_convert_to_ragged(is_ragged_input, outputs, row_lengths)
else:
output = last_output
So in total, we can see that return_sequences=False
just uses outputs[-1]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With