In Attention Is All You Need, the authors implement a positional embedding (which adds information about where a word is in a sequence). For this, they use a sinusoidal embedding:
PE(pos,2i) = sin(pos/10000**(2*i/hidden_units))
PE(pos,2i+1) = cos(pos/10000**(2*i/hidden_units))
where pos is the position and i is the dimension. It must result in an embedding matrix of shape [max_length, embedding_size], i.e., given a position in a sequence, it returns the tensor of PE[position,:].
I found the Kyubyong's implementation, but I do not fully understand it.
I tried to implement it in numpy the following way:
hidden_units = 100 # Dimension of embedding
vocab_size = 10 # Maximum sentence length
# Matrix of [[1, ..., 99], [1, ..., 99], ...]
i = np.tile(np.expand_dims(range(hidden_units), 0), [vocab_size, 1])
# Matrix of [[1, ..., 1], [2, ..., 2], ...]
pos = np.tile(np.expand_dims(range(vocab_size), 1), [1, hidden_units])
# Apply the intermediate funcitons
pos = np.multiply(pos, 1/10000.0)
i = np.multiply(i, 2.0/hidden_units)
matrix = np.power(pos, i)
# Apply the sine function to the even colums
matrix[:, 1::2] = np.sin(matrix[:, 1::2]) # even
# Apply the cosine function to the odd columns
matrix[:, ::2] = np.cos(matrix[:, ::2]) # odd
# Plot
im = plt.imshow(matrix, cmap='hot', aspect='auto')
I don't understand how this matrix can give information on the position of inputs. Could someone first tell me if this is the right way to compute it and second what is the rationale behind it?
Thank you.
I found the answer in a pytorch implementation:
# keep dim 0 for padding token position encoding zero vector
position_enc = np.array([
[pos / np.power(10000, 2*i/d_pos_vec) for i in range(d_pos_vec)]
if pos != 0 else np.zeros(d_pos_vec) for pos in range(n_position)])
position_enc[1:, 0::2] = np.sin(position_enc[1:, 0::2]) # dim 2i
position_enc[1:, 1::2] = np.cos(position_enc[1:, 1::2]) # dim 2i+1
return torch.from_numpy(position_enc).type(torch.FloatTensor)
where d_pos_vec is the embedding dimension and n_position the max sequence length.
EDIT:
In the paper, the authors say that this representation of the embedding matrix allows "the model to extrapolate to sequence lengths longer than the ones encountered during training".
The only difference between two positions is the pos
variable. Check the image below for a graphical representation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With