Why does embedding vector multiplied by a constant in Transformer model?

Tags:

I am learning to apply Transform model proposed by Attention Is All You Need from tensorflow official document Transformer model for language understanding.

As section Positional encoding says:

Since this model doesn't contain any recurrence or convolution, positional encoding is added to give the model some information about the relative position of the words in the sentence.

The positional encoding vector is added to the embedding vector.

My understanding is to add positional encoding vector directly to embedding vector. But I found embedding vector multiplied by a constant when I looked at the code.

The code in section Encoder as follows:

class Encoder(tf.keras.layers.Layer):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, 
               rate=0.1):
    super(Encoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
    self.pos_encoding = positional_encoding(input_vocab_size, self.d_model)


    self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) 
                       for _ in range(num_layers)]

    self.dropout = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):

    seq_len = tf.shape(x)[1]

    # adding embedding and position encoding.
    x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x = self.enc_layers[i](x, training, mask)

    return x  # (batch_size, input_seq_len, d_model)

We can see x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32)) before x += self.pos_encoding[:, :seq_len, :].

So why does embedding vector multiplied by a constant before adding positional encoding in Transformer model?

808

asked Jul 08 '19 08:07

giser_yugang

1 Answers

Looking around it, I found this argument 1:

The reason we increase the embedding values before the addition is to make the positional encoding relatively smaller. This means the original meaning in the embedding vector won’t be lost when we add them together.

173

answered Oct 21 '22 07:10

Vladimir Araujo

Related questions
                            
                                python multiprocessing - OverflowError('cannot serialize a bytes object larger than 4GiB')
                            
                                Select rows of pandas dataframe from list, in order of list
                            
                                Sqlite with real "Full Text Search" and spelling mistakes (FTS+spellfix together)
                            
                                Jupyter notebook kernel not connecting
                            
                                Don't know how to uninstall unwanted Spacy installation, model
                            
                                How to activate a specific Python environment as part of my submission to Slurm?
                            
                                Clarification about keras.utils.Sequence
                            
                                Python: Argparse with list of lists
                            
                                What are all Python Pandas .agg functions?
                            
                                Can't run Python code in Visual Studio Code with Jupyter - "Jupyter kernel cannot be started from 'Python 3.6.8 64-bit"
                            
                                How to use dash callback without an Input
                            
                                passing supplementary parameters to hyperopt objective function
                            
                                pip for python3.7 (Ubuntu 16.04)
                            
                                How to send message to Viber bot with Python?
                            
                                Find y value for respective x from python plot (matplotlib)
                            
                                Function that will extract hour values from one table and populate "buckets" of one hour increments in another table
                            
                                LaTeX equations do not render in google Colaboratory when using matplotlib
                            
                                How to find the cause of "task queue depth" warnings from waitress?
                            
                                PyTorch - Getting the 'TypeError: pic should be PIL Image or ndarray. Got <class 'numpy.ndarray'>' error
                            
                                Is there a way i can detect the image orientation and rotate the image to the right angle?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does embedding vector multiplied by a constant in Transformer model?

Tags:

python

tensorflow

deep-learning

attention-model

giser_yugang

People also ask

1 Answers

Vladimir Araujo

Recent Activity

Donate For Us