Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the inputs to the transformer encoder and decoder in BERT?

I was reading the BERT paper and was not clear regarding the inputs to the transformer encoder and decoder.

For learning masked language model (Cloze task), the paper says that 15% of the tokens are masked and the network is trained to predict the masked tokens. Since this is the case, what are the inputs to the transformer encoder and decoder?

BERT input representation (from the paper)

Is the input to the transformer encoder this input representation (see image above). If so, what is the decoder input?

Further, how is the output loss computed? Is it a softmax for only the masked locations? For this, the same linear layer is used for all masked tokens?

like image 750
mysticsasuke Avatar asked Feb 24 '20 19:02

mysticsasuke


People also ask

Is there a transformer decoder in Bert?

Ah, but you see, BERT does not include a Transformer decoder. It is only the encoder part, with a classifier added on top. For masked word prediction, the classifier acts as a decoder of sorts, trying to reconstruct the true identities of the masked words.

How does the encoder in a transformer work?

The encoder in the transformer consists of multiple encoder blocks. An input sentence goes through the encoder blocks, and the output of the last encoder block becomes the input features to the decoder.

Why does Bert need an encoder?

As a result, using the encoder enables BERT to encode the semantic and syntactic information in the embedding, which is needed for a wide range of tasks. This already tells us a lot about BERT. First, it’s not designed for tasks like text generation or translations, because it uses the encoder.

What is the input and output of the decoder?

At each decoding time step, the decoder receives 2 inputs: the encoder output: this is computed once and is fed to all layers of the decoder at each decoding time step as key ( K e n d e c) and value ( V e n d e c) for the encoder-decoder attention blocks.


1 Answers

Ah, but you see, BERT does not include a Transformer decoder. It is only the encoder part, with a classifier added on top.

For masked word prediction, the classifier acts as a decoder of sorts, trying to reconstruct the true identities of the masked words. Classifying Non-masked is not included in the classification task and does not effect loss.

BERT is also trained on predicting whether a pair of sentences really does precedes one another or not.

I do not remember how the two losses are weighted.

I hope this draws a clearer picture.

like image 58
user2182857 Avatar answered Oct 29 '22 06:10

user2182857