I'm following Pytorch seq2seq tutorial and it<code>torch.bmm</code> method is used like below: <pre class="prettyprint"><code>attn_applied = torch.bmm(attn_weights.unsqueeze(0), encoder_outputs.unsqueeze(0)) </code></pre> I understand why we need to multiply attention weight and encoder outputs. What I don't quite understand is the reason why we need <code>bmm</code> method here. <code>torch.bmm</code> document says <blockquote> Performs a batch matrix-matrix product of matrices stored in batch1 and batch2. batch1 and batch2 must be 3-D tensors each containing the same number of matrices. If batch1 is a (b×n×m) tensor, batch2 is a (b×m×p) tensor, out will be a (b×n×p) tensor. </blockquote> <img src="https://i.stack.imgur.com/bkSZN.png" alt="enter image description here">

In the seq2seq model, the encoder encodes the input sequences given in as mini-batches. Say for example, the input is <code>B x S x d</code> where B is the batch size, S is the maximum sequence length and d is the word embedding dimension. Then the encoder's output is <code>B x S x h</code> where h is the hidden state size of the encoder (which is an RNN). Now while decoding (during training) the input sequences are given one at a time, so the input is <code>B x 1 x d</code> and the decoder produces a tensor of shape <code>B x 1 x h</code>. Now to compute the context vector, we need to compare this decoder hidden state with the encoder's encoded states. So, consider you have two tensors of shape <code>T1 = B x S x h</code> and <code>T2 = B x 1 x h</code>. So if you can do batch matrix multiplication as follows. <pre class="prettyprint"><code>out = torch.bmm(T1, T2.transpose(1, 2)) </code></pre> Essentially you are multiplying a tensor of shape <code>B x S x h</code> with a tensor of shape <code>B x h x 1</code> and it will result in <code>B x S x 1</code> which is the attention weight for each batch. Here, the attention weights <code>B x S x 1</code> represent a similarity score between the decoder's current hidden state and encoder's all the hidden states. Now you can take the attention weights to multiply with the encoder's hidden state <code>B x S x h</code> by transposing first and it will result in a tensor of shape <code>B x h x 1</code>. And if you perform squeeze at dim=2, you will get a tensor of shape <code>B x h</code> which is your context vector. This context vector (<code>B x h</code>) is usually concatenated to decoder's hidden state (<code>B x 1 x h</code>, squeeze dim=1) to predict the next token.

while @wasiahmad is right about the general implementation of seq2seq, in the mentioned tutorial there's no batch (B=1), and the <code>bmm</code> is just over-engineering and can be safely replaced with <code>matmul</code> with the exact same model quality and performance. See for yourself, replace this: <pre class="prettyprint lang-py prettyprint-override"><code> attn_applied = torch.bmm(attn_weights.unsqueeze(0), encoder_outputs.unsqueeze(0)) output = torch.cat((embedded[0], attn_applied[0]), 1) </code></pre> with this: <pre class="prettyprint lang-py prettyprint-override"><code> attn_applied = torch.matmul(attn_weights, encoder_outputs) output = torch.cat((embedded[0], attn_applied), 1) </code></pre> and run the notebook. <hr> Also, note that while @wasiahmad talks about the encoder input as <code>B x S x d</code>, in pytorch 1.7.0, the GRU which is the main engine of the encoder expects an input format of <code>(seq_len, batch, input_size)</code> by default. If you want to work with @wasiahmad format, pass the <code>batch_first = True</code> flag.

Why do we do batch matrix-matrix product?

Tags:

deep-learning

pytorch

seq2seq

I'm following Pytorch seq2seq tutorial and ittorch.bmm method is used like below:

attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                         encoder_outputs.unsqueeze(0))

I understand why we need to multiply attention weight and encoder outputs.

What I don't quite understand is the reason why we need bmm method here. torch.bmm document says

Performs a batch matrix-matrix product of matrices stored in batch1 and batch2.

batch1 and batch2 must be 3-D tensors each containing the same number of matrices.

If batch1 is a (b×n×m) tensor, batch2 is a (b×m×p) tensor, out will be a (b×n×p) tensor.

enter image description here

873

asked Jun 12 '18 22:06

aerin

2 Answers

In the seq2seq model, the encoder encodes the input sequences given in as mini-batches. Say for example, the input is B x S x d where B is the batch size, S is the maximum sequence length and d is the word embedding dimension. Then the encoder's output is B x S x h where h is the hidden state size of the encoder (which is an RNN).

Now while decoding (during training) the input sequences are given one at a time, so the input is B x 1 x d and the decoder produces a tensor of shape B x 1 x h. Now to compute the context vector, we need to compare this decoder hidden state with the encoder's encoded states.

So, consider you have two tensors of shape T1 = B x S x h and T2 = B x 1 x h. So if you can do batch matrix multiplication as follows.

out = torch.bmm(T1, T2.transpose(1, 2))

Essentially you are multiplying a tensor of shape B x S x h with a tensor of shape B x h x 1 and it will result in B x S x 1 which is the attention weight for each batch.

Here, the attention weights B x S x 1 represent a similarity score between the decoder's current hidden state and encoder's all the hidden states. Now you can take the attention weights to multiply with the encoder's hidden state B x S x h by transposing first and it will result in a tensor of shape B x h x 1. And if you perform squeeze at dim=2, you will get a tensor of shape B x h which is your context vector.

This context vector (B x h) is usually concatenated to decoder's hidden state (B x 1 x h, squeeze dim=1) to predict the next token.

133

answered Oct 04 '22 15:10

Wasi Ahmad

while @wasiahmad is right about the general implementation of seq2seq, in the mentioned tutorial there's no batch (B=1), and the bmm is just over-engineering and can be safely replaced with matmul with the exact same model quality and performance. See for yourself, replace this:

        attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                                 encoder_outputs.unsqueeze(0))
        output = torch.cat((embedded[0], attn_applied[0]), 1)

with this:

        attn_applied = torch.matmul(attn_weights,
                                 encoder_outputs)
        output = torch.cat((embedded[0], attn_applied), 1)

and run the notebook.

Also, note that while @wasiahmad talks about the encoder input as B x S x d, in pytorch 1.7.0, the GRU which is the main engine of the encoder expects an input format of (seq_len, batch, input_size) by default. If you want to work with @wasiahmad format, pass the batch_first = True flag.

answered Oct 04 '22 14:10

ihadanny

Related questions
                            
                                Does dropout layer go before or after dense layer in TensorFlow?
                            
                                Strange behaviour of the loss function in keras model, with pretrained convolutional base
                            
                                Clarification about keras.utils.Sequence
                            
                                PyTorch - Getting the 'TypeError: pic should be PIL Image or ndarray. Got <class 'numpy.ndarray'>' error
                            
                                Why does embedding vector multiplied by a constant in Transformer model?
                            
                                How to disable GPU in keras with tensorflow?
                            
                                How do I select only a specific digit from the MNIST dataset provided by Keras?
                            
                                Can I implement deep learning models in my laptop with intel hd graphics
                            
                                Keras - How to construct a shared Embedding() Layer for each Input-Neuron
                            
                                Keras multiple binary outputs
                            
                                Tensorflow: 'tf.get_default_session()` after sess=tf.Session() is None
                            
                                BatchNorm momentum convention PyTorch
                            
                                Validation and Testing accuracy widely different
                            
                                Keras custom loss as a function of multiple outputs
                            
                                How to find "num_words" or vocabulary size of Keras tokenizer when one is not assigned?
                            
                                keras: how to use learning rate decay with model.train_on_batch()
                            
                                How can take advantage of multiprocessing and multithreading in Deep learning using Keras?
                            
                                caffe data layer example step by step
                            
                                LSTM-Keras Error: ValueError: non-broadcastable output operand with shape (67704,1) doesn't match the broadcast shape (67704,12)
                            
                                Tensorflow error: Using a `tf.Tensor` as a Python `bool` is not allowed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With