Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does embedding vector multiplied by a constant in Transformer model?

I am learning to apply Transform model proposed by Attention Is All You Need from tensorflow official document Transformer model for language understanding.

As section Positional encoding says:

Since this model doesn't contain any recurrence or convolution, positional encoding is added to give the model some information about the relative position of the words in the sentence.

The positional encoding vector is added to the embedding vector.

My understanding is to add positional encoding vector directly to embedding vector. But I found embedding vector multiplied by a constant when I looked at the code.

The code in section Encoder as follows:

class Encoder(tf.keras.layers.Layer):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, 
               rate=0.1):
    super(Encoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
    self.pos_encoding = positional_encoding(input_vocab_size, self.d_model)


    self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) 
                       for _ in range(num_layers)]

    self.dropout = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):

    seq_len = tf.shape(x)[1]

    # adding embedding and position encoding.
    x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x = self.enc_layers[i](x, training, mask)

    return x  # (batch_size, input_seq_len, d_model)

We can see x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32)) before x += self.pos_encoding[:, :seq_len, :].

So why does embedding vector multiplied by a constant before adding positional encoding in Transformer model?

like image 808
giser_yugang Avatar asked Jul 08 '19 08:07

giser_yugang


People also ask

What Embeddings are used in transformers?

Word2Vec and GloVe are based on static word embeddings while Transformers are based on dynamic word embeddings. The embeddings are trained from scratch.

Why do we add positional encoding in transformer model?

Hence, positional information is added to the model explicitly to retain the information regarding the order of words in a sentence. Positional encoding is the scheme through which the knowledge of order of objects in a sequence is maintained.

What are positional Embeddings?

Position embeddings (PEs) are crucial in Transformer-based architectures for capturing word or- der; without them, the representation is bag-of-words. Fully learnable absolute position embed- dings (APEs) were first proposed by Gehring et al. (2017) to capture word position in Convolutional Seq2seq architectures.

Which Embeddings does spaCy use?

spaCy lets you share a single transformer or other token-to-vector (“tok2vec”) embedding layer between multiple components. You can even update the shared layer, performing multi-task learning.

What is an embedding model?

An embedding model will factorize the input into a vector and that vector will be used to predict the next movie. This means that similar vectors are movies that are commonly watched after similar movies.

How does the transformer combine two encodings?

The Transformer combines these two encodings by adding them. The Transformer has two Embedding layers. The input sequence is fed to the first Embedding layer, known as the Input Embedding. The target sequence is fed to the second Embedding layer after shifting the targets right by one position and inserting a Start token in the first position.

How does the transformer model work?

The Transformer model is auto-regressive, it makes predictions one part at a time, and uses its output to decide what to do next. During training, we are using teacher-forcing.

What is the batch size of the embedding vector?

Let’s say the true output is a three-word sentence and embedding vector dimension is five and our vocabulary size is twenty. Lastly, our batch size is 1 for ease of explanation. After passing these words from the embedding vector, we have a [1,3,5] (batch-size, number of words, the dimension of embedding) dimensional matrix.


1 Answers

Looking around it, I found this argument 1:

The reason we increase the embedding values before the addition is to make the positional encoding relatively smaller. This means the original meaning in the embedding vector won’t be lost when we add them together.

like image 173
Vladimir Araujo Avatar answered Oct 21 '22 07:10

Vladimir Araujo