I need some clarity on how to correctly prepare inputs for batch-training using different components of the <code>torch.nn</code> module. Specifically, I'm looking to create an encoder-decoder network for a seq2seq model. Suppose I have a module with these three layers, in order: <ol> <li><code>nn.Embedding</code></li> <li><code>nn.LSTM</code></li> <li><code>nn.Linear</code></li> </ol> <h3><code>nn.Embedding</code></h3> Input: <code>batch_size * seq_length</code> Output: <code>batch_size * seq_length * embedding_dimension</code> I don't have any problems here, I just want to be explicit about the expected shape of the input and output. <h3><code>nn.LSTM</code></h3> Input: <code>seq_length * batch_size * input_size</code> (<code>embedding_dimension</code> in this case) Output: <code>seq_length * batch_size * hidden_size</code> last_hidden_state: <code>batch_size * hidden_size</code> last_cell_state: <code>batch_size * hidden_size</code> To use the output of the <code>Embedding</code> layer as input for the <code>LSTM</code> layer, I need to transpose axis 1 and 2. Many examples I've found online do something like <code>x = embeds.view(len(sentence), self.batch_size , -1)</code>, but that confuses me. How does this view ensure that elements of the same batch remain in the same batch? What happens when <code>len(sentence)</code> and <code>self.batch</code> size are of same size? <h3><code>nn.Linear</code></h3> Input: <code>batch_size</code> x <code>input_size</code> (hidden_size of LSTM in this case or ??) Output: <code>batch_size</code> x <code>output_size</code> If I only need the <code>last_hidden_state</code> of <code>LSTM</code>, then I can give it as input to <code>nn.Linear</code>. But if I want to make use of Output (which contains all intermediate hidden states as well) then I need to change <code>nn.Linear</code>'s input size to <code>seq_length * hidden_size</code> and to use Output as input to <code>Linear</code> module I need to transpose axis 1 and 2 of output and then I can view with <code>Output_transposed(batch_size, -1)</code>. Is my understanding here correct? How do I carry out these transpose operations in tensors <code>(tensor.transpose(0, 1))</code>?

Your understanding of most of the concepts is accurate, but, there are some missing points here and there. <h3>Interfacing embedding to LSTM (Or any other recurrent unit)</h3> You have embedding output in the shape of <code>(batch_size, seq_len, embedding_size)</code>. Now, there are various ways through which you can pass this to the LSTM. * You can pass this directly to the <code>LSTM</code>, if <code>LSTM</code> accepts input as <code>batch_first</code>. So, while creating your <code>LSTM</code> pass argument <code>batch_first=True</code>. * Or, you can pass input in the shape of <code>(seq_len, batch_size, embedding_size)</code>. So, to convert your embedding output to this shape, you’ll need to transpose the first and second dimensions using <code>torch.transpose(tensor_name, 0, 1)</code>, like you mentioned. <blockquote> Q. I see many examples online which do something like x = embeds.view(len(sentence), self.batch_size , -1) which confuses me. A. This is wrong. It will mix up batches and you will be trying to learn a hopeless learning task. Wherever you see this, you can tell the author to change this statement and use transpose instead. </blockquote> There is an argument in favor of not using <code>batch_first</code>, which states that the underlying API provided by Nvidia CUDA runs considerably faster using batch as secondary. <h3>Using context size</h3> You are directly feeding the embedding output to LSTM, this will fix the input size of LSTM to context size of 1. This means that if your input is words to LSTM, you will be giving it one word at a time always. But, this is not what we want all the time. So, you need to expand the context size. This can be done as follows - <pre class="prettyprint"><code># Assuming that embeds is the embedding output and context_size is a defined variable embeds = embeds.unfold(1, context_size, 1) # Keeping the step size to be 1 embeds = embeds.view(embeds.size(0), embeds.size(1), -1) </code></pre> Unfold documentation Now, you can proceed as mentioned above to feed this to the <code>LSTM</code>, just remembed that <code>seq_len</code> is now changed to <code>seq_len - context_size + 1</code> and <code>embedding_size</code> (which is the input size of the LSTM) is now changed to <code>context_size * embedding_size</code> <h3>Using variable sequence lengths</h3> Input size of different instances in a batch will not be the same always. For example, some of your sentence might be 10 words long and some might be 15 and some might be 1000. So, you definitely want variable length sequence input to your recurrent unit. To do this, there are some additional steps that needs to be performed before you can feed your input to the network. You can follow these steps - 1. Sort your batch from largest sequence to the smallest. 2. Create a <code>seq_lengths</code> array that defines the length of each sequence in the batch. (This can be a simple python list) 3. Pad all the sequences to be of equal length to the largest sequence. 4. Create LongTensor Variable of this batch. 5. Now, after passing the above variable through embedding and creating the proper context size input, you’ll need to pack your sequence as follows - <pre class="prettyprint"><code># Assuming embeds to be the proper input to the LSTM lstm_input = nn.utils.rnn.pack_padded_sequence(embeds, [x - context_size + 1 for x in seq_lengths], batch_first=False) </code></pre> <h3>Understanding output of LSTM</h3> Now, once you have prepared your <code>lstm_input</code> acc. To your needs, you can call lstm as <pre class="prettyprint"><code>lstm_outs, (h_t, h_c) = lstm(lstm_input, (h_t, h_c)) </code></pre> Here, <code>(h_t, h_c)</code> needs to be provided as the initial hidden state and it will output the final hidden state. You can see, why packing variable length sequence is required, otherwise LSTM will run the over the non-required padded words as well. Now, <code>lstm_outs</code> will be a packed sequence which is the output of lstm at every step and <code>(h_t, h_c)</code> are the final outputs and the final cell state respectively. <code>h_t</code> and <code>h_c</code> will be of shape <code>(batch_size, lstm_size)</code>. You can use these directly for further input, but if you want to use the intermediate outputs as well you’ll need to unpack the <code>lstm_outs</code> first as below <pre class="prettyprint"><code>lstm_outs, _ = nn.utils.rnn.pad_packed_sequence(lstm_outs) </code></pre> Now, your <code>lstm_outs</code> will be of shape <code>(max_seq_len - context_size + 1, batch_size, lstm_size)</code>. Now, you can extract the intermediate outputs of lstm according to your need. <blockquote> Remember that the unpacked output will have 0s after the size of each batch, which is just padding to match the length of the largest sequence (which is always the first one, as we sorted the input from largest to the smallest). Also note that, h_t will always be equal to the last element for each batch output. </blockquote> <h3>Interfacing lstm to linear</h3> Now, if you want to use just the output of the lstm, you can directly feed <code>h_t</code> to your linear layer and it will work. But, if you want to use intermediate outputs as well, then, you’ll need to figure out, how are you going to input this to the linear layer (through some attention network or some pooling). You do not want to input the complete sequence to the linear layer, as different sequences will be of different lengths and you can’t fix the input size of the linear layer. And yes, you’ll need to transpose the output of lstm to be further used (Again you cannot use view here). <blockquote> Ending Note: I have purposefully left some points, such as using bidirectional recurrent cells, using step size in unfold, and interfacing attention, as they can get quite cumbersome and will be out of the scope of this answer. </blockquote>

How to correctly give inputs to Embedding, LSTM and Linear layers in PyTorch?

Tags:

lstm

pytorch

I need some clarity on how to correctly prepare inputs for batch-training using different components of the torch.nn module. Specifically, I'm looking to create an encoder-decoder network for a seq2seq model.

Suppose I have a module with these three layers, in order:

nn.Embedding
nn.LSTM
nn.Linear

`nn.Embedding`

Input: batch_size * seq_length
Output: batch_size * seq_length * embedding_dimension

I don't have any problems here, I just want to be explicit about the expected shape of the input and output.

`nn.LSTM`

Input: seq_length * batch_size * input_size (embedding_dimension in this case)
Output: seq_length * batch_size * hidden_size
last_hidden_state: batch_size * hidden_size
last_cell_state: batch_size * hidden_size

To use the output of the Embedding layer as input for the LSTM layer, I need to transpose axis 1 and 2.

Many examples I've found online do something like x = embeds.view(len(sentence), self.batch_size , -1), but that confuses me. How does this view ensure that elements of the same batch remain in the same batch? What happens when len(sentence) and self.batch size are of same size?

`nn.Linear`

Input: batch_size x input_size (hidden_size of LSTM in this case or ??)
Output: batch_size x output_size

If I only need the last_hidden_state of LSTM, then I can give it as input to nn.Linear.

But if I want to make use of Output (which contains all intermediate hidden states as well) then I need to change nn.Linear's input size to seq_length * hidden_size and to use Output as input to Linear module I need to transpose axis 1 and 2 of output and then I can view with Output_transposed(batch_size, -1).

Is my understanding here correct? How do I carry out these transpose operations in tensors (tensor.transpose(0, 1))?

753

asked Mar 24 '18 16:03

Silpara

1 Answers

Your understanding of most of the concepts is accurate, but, there are some missing points here and there.

Interfacing embedding to LSTM (Or any other recurrent unit)

You have embedding output in the shape of (batch_size, seq_len, embedding_size). Now, there are various ways through which you can pass this to the LSTM.
* You can pass this directly to the LSTM, if LSTM accepts input as batch_first. So, while creating your LSTM pass argument batch_first=True.
* Or, you can pass input in the shape of (seq_len, batch_size, embedding_size). So, to convert your embedding output to this shape, you’ll need to transpose the first and second dimensions using torch.transpose(tensor_name, 0, 1), like you mentioned.

Q. I see many examples online which do something like x = embeds.view(len(sentence), self.batch_size , -1) which confuses me.
A. This is wrong. It will mix up batches and you will be trying to learn a hopeless learning task. Wherever you see this, you can tell the author to change this statement and use transpose instead.

There is an argument in favor of not using batch_first, which states that the underlying API provided by Nvidia CUDA runs considerably faster using batch as secondary.

Using context size

You are directly feeding the embedding output to LSTM, this will fix the input size of LSTM to context size of 1. This means that if your input is words to LSTM, you will be giving it one word at a time always. But, this is not what we want all the time. So, you need to expand the context size. This can be done as follows -

# Assuming that embeds is the embedding output and context_size is a defined variable embeds = embeds.unfold(1, context_size, 1)  # Keeping the step size to be 1 embeds = embeds.view(embeds.size(0), embeds.size(1), -1)

Unfold documentation
Now, you can proceed as mentioned above to feed this to the LSTM, just remembed that seq_len is now changed to seq_len - context_size + 1 and embedding_size (which is the input size of the LSTM) is now changed to context_size * embedding_size

Using variable sequence lengths

Input size of different instances in a batch will not be the same always. For example, some of your sentence might be 10 words long and some might be 15 and some might be 1000. So, you definitely want variable length sequence input to your recurrent unit. To do this, there are some additional steps that needs to be performed before you can feed your input to the network. You can follow these steps -
1. Sort your batch from largest sequence to the smallest.
2. Create a seq_lengths array that defines the length of each sequence in the batch. (This can be a simple python list)
3. Pad all the sequences to be of equal length to the largest sequence.
4. Create LongTensor Variable of this batch.
5. Now, after passing the above variable through embedding and creating the proper context size input, you’ll need to pack your sequence as follows -

# Assuming embeds to be the proper input to the LSTM lstm_input = nn.utils.rnn.pack_padded_sequence(embeds, [x - context_size + 1 for x in seq_lengths], batch_first=False)

Understanding output of LSTM

Now, once you have prepared your lstm_input acc. To your needs, you can call lstm as

lstm_outs, (h_t, h_c) = lstm(lstm_input, (h_t, h_c))

Here, (h_t, h_c) needs to be provided as the initial hidden state and it will output the final hidden state. You can see, why packing variable length sequence is required, otherwise LSTM will run the over the non-required padded words as well.
Now, lstm_outs will be a packed sequence which is the output of lstm at every step and (h_t, h_c) are the final outputs and the final cell state respectively. h_t and h_c will be of shape (batch_size, lstm_size). You can use these directly for further input, but if you want to use the intermediate outputs as well you’ll need to unpack the lstm_outs first as below

lstm_outs, _ = nn.utils.rnn.pad_packed_sequence(lstm_outs)

Now, your lstm_outs will be of shape (max_seq_len - context_size + 1, batch_size, lstm_size). Now, you can extract the intermediate outputs of lstm according to your need.

Remember that the unpacked output will have 0s after the size of each batch, which is just padding to match the length of the largest sequence (which is always the first one, as we sorted the input from largest to the smallest).

Also note that, h_t will always be equal to the last element for each batch output.

Interfacing lstm to linear

Now, if you want to use just the output of the lstm, you can directly feed h_t to your linear layer and it will work. But, if you want to use intermediate outputs as well, then, you’ll need to figure out, how are you going to input this to the linear layer (through some attention network or some pooling). You do not want to input the complete sequence to the linear layer, as different sequences will be of different lengths and you can’t fix the input size of the linear layer. And yes, you’ll need to transpose the output of lstm to be further used (Again you cannot use view here).

Ending Note: I have purposefully left some points, such as using bidirectional recurrent cells, using step size in unfold, and interfacing attention, as they can get quite cumbersome and will be out of the scope of this answer.

105

answered Sep 20 '22 10:09

layog

Related questions
                            
                                Keras lstm with masking layer for variable-length inputs
                            
                                Using Dropout with Keras and LSTM/GRU cell
                            
                                Difference between 1 LSTM with num_layers = 2 and 2 LSTMs in pytorch
                            
                                Proper way to feed time-series data to stateful LSTM?
                            
                                How to implement Tensorflow batch normalization in LSTM
                            
                                Pytorch LSTM vs LSTMCell
                            
                                Understanding stateful LSTM [closed]
                            
                                How to implement a deep bidirectional LSTM with Keras?
                            
                                Tensorflow: Attempting to use uninitialized value beta1_power
                            
                                Using pre-trained word2vec with LSTM for word generation
                            
                                Building a mutlivariate, multi-task LSTM with Keras
                            
                                TensorFlow: Remember LSTM state for next batch (stateful LSTM)
                            
                                ValueError: Input 0 is incompatible with layer lstm_13: expected ndim=3, found ndim=4
                            
                                Why does Keras LSTM batch size used for prediction have to be the same as fitting batch size?
                            
                                How do I mask a loss function in Keras with the TensorFlow backend?
                            
                                Understanding Keras LSTMs: Role of Batch-size and Statefulness
                            
                                Error when checking model input: expected lstm_1_input to have 3 dimensions, but got array with shape (339732, 29)
                            
                                Multivariate LSTM with missing values
                            
                                Understanding a simple LSTM pytorch
                            
                                Shuffling training data with LSTM RNN

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With