I don't understand the position embedding in paper Convolutional Sequence to Sequence Learning, anyone can help me?
From what I understand, for each word to translate, the input contains both the word itself and its position in the input chain (say, 0, 1, ...m).
Now, encoding such a data with simply having a cell with value pos (in 0..m) would not perform very well (for the same reason we use one-hot vectors to encode words). So, basically, the position will be encoded in a number of input cells, with one-hot representation (or similar, I might think of a binary representation of the position being used).
Then, an embedding layer will be used (just as it is used for word encodings) to transform this sparse and discrete representation into a continuous one.
The representation used in the paper chose to have the same dimension for the word embedding and the position embedding and to simply sum up the two.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With