Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to correctly give inputs to Embedding, LSTM and Linear layers in PyTorch?

Tags:

lstm

pytorch

I need some clarity on how to correctly prepare inputs for batch-training using different components of the torch.nn module. Specifically, I'm looking to create an encoder-decoder network for a seq2seq model.

Suppose I have a module with these three layers, in order:

  1. nn.Embedding
  2. nn.LSTM
  3. nn.Linear

nn.Embedding

Input: batch_size * seq_length
Output: batch_size * seq_length * embedding_dimension

I don't have any problems here, I just want to be explicit about the expected shape of the input and output.

nn.LSTM

Input: seq_length * batch_size * input_size (embedding_dimension in this case)
Output: seq_length * batch_size * hidden_size
last_hidden_state: batch_size * hidden_size
last_cell_state: batch_size * hidden_size

To use the output of the Embedding layer as input for the LSTM layer, I need to transpose axis 1 and 2.

Many examples I've found online do something like x = embeds.view(len(sentence), self.batch_size , -1), but that confuses me. How does this view ensure that elements of the same batch remain in the same batch? What happens when len(sentence) and self.batch size are of same size?

nn.Linear

Input: batch_size x input_size (hidden_size of LSTM in this case or ??)
Output: batch_size x output_size

If I only need the last_hidden_state of LSTM, then I can give it as input to nn.Linear.

But if I want to make use of Output (which contains all intermediate hidden states as well) then I need to change nn.Linear's input size to seq_length * hidden_size and to use Output as input to Linear module I need to transpose axis 1 and 2 of output and then I can view with Output_transposed(batch_size, -1).

Is my understanding here correct? How do I carry out these transpose operations in tensors (tensor.transpose(0, 1))?

like image 753
Silpara Avatar asked Mar 24 '18 16:03

Silpara


People also ask

What does embedding layer do in Pytorch?

Uses of PyTorch Embedding This helps us to represent the vectors with dimensions where words help reduce the vector's dimensions. We can say that the embedding layer works like a lookup table where each word are converted to numbers, and these numbers can be used to make up the table.

What is the output of LSTM in Pytorch?

The output of the Pytorch LSTM layer is a tuple with two elements.

Does Pytorch have LSTM?

Even the LSTM example on Pytorch's official documentation only applies it to a natural language problem, which can be disorienting when trying to get these recurrent models working on time series data.


1 Answers

Your understanding of most of the concepts is accurate, but, there are some missing points here and there.

Interfacing embedding to LSTM (Or any other recurrent unit)

You have embedding output in the shape of (batch_size, seq_len, embedding_size). Now, there are various ways through which you can pass this to the LSTM.
* You can pass this directly to the LSTM, if LSTM accepts input as batch_first. So, while creating your LSTM pass argument batch_first=True.
* Or, you can pass input in the shape of (seq_len, batch_size, embedding_size). So, to convert your embedding output to this shape, you’ll need to transpose the first and second dimensions using torch.transpose(tensor_name, 0, 1), like you mentioned.

Q. I see many examples online which do something like x = embeds.view(len(sentence), self.batch_size , -1) which confuses me.
A. This is wrong. It will mix up batches and you will be trying to learn a hopeless learning task. Wherever you see this, you can tell the author to change this statement and use transpose instead.

There is an argument in favor of not using batch_first, which states that the underlying API provided by Nvidia CUDA runs considerably faster using batch as secondary.

Using context size

You are directly feeding the embedding output to LSTM, this will fix the input size of LSTM to context size of 1. This means that if your input is words to LSTM, you will be giving it one word at a time always. But, this is not what we want all the time. So, you need to expand the context size. This can be done as follows -

# Assuming that embeds is the embedding output and context_size is a defined variable embeds = embeds.unfold(1, context_size, 1)  # Keeping the step size to be 1 embeds = embeds.view(embeds.size(0), embeds.size(1), -1) 

Unfold documentation
Now, you can proceed as mentioned above to feed this to the LSTM, just remembed that seq_len is now changed to seq_len - context_size + 1 and embedding_size (which is the input size of the LSTM) is now changed to context_size * embedding_size

Using variable sequence lengths

Input size of different instances in a batch will not be the same always. For example, some of your sentence might be 10 words long and some might be 15 and some might be 1000. So, you definitely want variable length sequence input to your recurrent unit. To do this, there are some additional steps that needs to be performed before you can feed your input to the network. You can follow these steps -
1. Sort your batch from largest sequence to the smallest.
2. Create a seq_lengths array that defines the length of each sequence in the batch. (This can be a simple python list)
3. Pad all the sequences to be of equal length to the largest sequence.
4. Create LongTensor Variable of this batch.
5. Now, after passing the above variable through embedding and creating the proper context size input, you’ll need to pack your sequence as follows -

# Assuming embeds to be the proper input to the LSTM lstm_input = nn.utils.rnn.pack_padded_sequence(embeds, [x - context_size + 1 for x in seq_lengths], batch_first=False) 

Understanding output of LSTM

Now, once you have prepared your lstm_input acc. To your needs, you can call lstm as

lstm_outs, (h_t, h_c) = lstm(lstm_input, (h_t, h_c)) 

Here, (h_t, h_c) needs to be provided as the initial hidden state and it will output the final hidden state. You can see, why packing variable length sequence is required, otherwise LSTM will run the over the non-required padded words as well.
Now, lstm_outs will be a packed sequence which is the output of lstm at every step and (h_t, h_c) are the final outputs and the final cell state respectively. h_t and h_c will be of shape (batch_size, lstm_size). You can use these directly for further input, but if you want to use the intermediate outputs as well you’ll need to unpack the lstm_outs first as below

lstm_outs, _ = nn.utils.rnn.pad_packed_sequence(lstm_outs) 

Now, your lstm_outs will be of shape (max_seq_len - context_size + 1, batch_size, lstm_size). Now, you can extract the intermediate outputs of lstm according to your need.

Remember that the unpacked output will have 0s after the size of each batch, which is just padding to match the length of the largest sequence (which is always the first one, as we sorted the input from largest to the smallest).

Also note that, h_t will always be equal to the last element for each batch output.

Interfacing lstm to linear

Now, if you want to use just the output of the lstm, you can directly feed h_t to your linear layer and it will work. But, if you want to use intermediate outputs as well, then, you’ll need to figure out, how are you going to input this to the linear layer (through some attention network or some pooling). You do not want to input the complete sequence to the linear layer, as different sequences will be of different lengths and you can’t fix the input size of the linear layer. And yes, you’ll need to transpose the output of lstm to be further used (Again you cannot use view here).

Ending Note: I have purposefully left some points, such as using bidirectional recurrent cells, using step size in unfold, and interfacing attention, as they can get quite cumbersome and will be out of the scope of this answer.

like image 105
layog Avatar answered Sep 20 '22 10:09

layog