I am learning to apply Transform model proposed by Attention Is All You Need from tensorflow official document Transformer model for language understanding.
As section Positional encoding says:
Since this model doesn't contain any recurrence or convolution, positional encoding is added to give the model some information about the relative position of the words in the sentence.
The positional encoding vector is added to the embedding vector.
My understanding is to add positional encoding vector
directly to embedding vector
. But I found embedding vector
multiplied by a constant when I looked at the code.
The code in section Encoder as follows:
class Encoder(tf.keras.layers.Layer):
def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
rate=0.1):
super(Encoder, self).__init__()
self.d_model = d_model
self.num_layers = num_layers
self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
self.pos_encoding = positional_encoding(input_vocab_size, self.d_model)
self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate)
for _ in range(num_layers)]
self.dropout = tf.keras.layers.Dropout(rate)
def call(self, x, training, mask):
seq_len = tf.shape(x)[1]
# adding embedding and position encoding.
x = self.embedding(x) # (batch_size, input_seq_len, d_model)
x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
x += self.pos_encoding[:, :seq_len, :]
x = self.dropout(x, training=training)
for i in range(self.num_layers):
x = self.enc_layers[i](x, training, mask)
return x # (batch_size, input_seq_len, d_model)
We can see x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
before x += self.pos_encoding[:, :seq_len, :]
.
So why does embedding vector multiplied by a constant before adding positional encoding in Transformer model?
Word2Vec and GloVe are based on static word embeddings while Transformers are based on dynamic word embeddings. The embeddings are trained from scratch.
Hence, positional information is added to the model explicitly to retain the information regarding the order of words in a sentence. Positional encoding is the scheme through which the knowledge of order of objects in a sequence is maintained.
Position embeddings (PEs) are crucial in Transformer-based architectures for capturing word or- der; without them, the representation is bag-of-words. Fully learnable absolute position embed- dings (APEs) were first proposed by Gehring et al. (2017) to capture word position in Convolutional Seq2seq architectures.
spaCy lets you share a single transformer or other token-to-vector (“tok2vec”) embedding layer between multiple components. You can even update the shared layer, performing multi-task learning.
An embedding model will factorize the input into a vector and that vector will be used to predict the next movie. This means that similar vectors are movies that are commonly watched after similar movies.
The Transformer combines these two encodings by adding them. The Transformer has two Embedding layers. The input sequence is fed to the first Embedding layer, known as the Input Embedding. The target sequence is fed to the second Embedding layer after shifting the targets right by one position and inserting a Start token in the first position.
The Transformer model is auto-regressive, it makes predictions one part at a time, and uses its output to decide what to do next. During training, we are using teacher-forcing.
Let’s say the true output is a three-word sentence and embedding vector dimension is five and our vocabulary size is twenty. Lastly, our batch size is 1 for ease of explanation. After passing these words from the embedding vector, we have a [1,3,5] (batch-size, number of words, the dimension of embedding) dimensional matrix.
Looking around it, I found this argument 1:
The reason we increase the embedding values before the addition is to make the positional encoding relatively smaller. This means the original meaning in the embedding vector won’t be lost when we add them together.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With