BERT token vs. embedding

Question

I understand that WordPiece is used to break text into tokens. And I understand that, somewhere in BERT, the model maps tokens into token embeddings that represent the meaning of the tokens. But what I don't understand is where this happens. Does this all happen in the transformer encoder? Or is there a separate "pre-encoder" that maps the tokens into embeddings before the transformer encoder starts updating the embeddings using its attention scheme?

Well, I mostly just tried reading about BERT.

noe · Accepted Answer

Inside BERT, as well as most other NLP deep learning models, the conversion from token IDs to vectors is done with an Embedding layer. For instance, in Pytorch it is the torch.nn.Embedding module.

The embedding layer works as a lookup table: it contains a table of vectors so that indexing such a table with a token index gives you the token vector. The embedding layer receives a tensor with integer values and outputs a tensor with the vectors associated with the input's token indices.

The lookup table inside the embedding layer is updated like any other trainable parameter in the model during the optimization steps.

The embedding layer is not a separate thing, but an integral part of the model.

Note that the embedding layer is coupled to the tokenizer used to convert the input text into token indices, because the token indices generated by the tokenizer are used as indices to the embedding layer's lookup table, so they must match.

mostafa amiri · Answer

The BERT model first splits the sentence into tokens using a tokenizer trained on the BPE algorithm. Then, in the first layer, which is an embedding layer, each ID is converted into a vector. This layer is a lookup table that assigns a fixed embedding to each token. In the next step, these embeddings pass through the encoder layers and the final embedding of each word is obtained with the attention mechanism, taking into account the context of the sentence. Therefore, in the first layer, the embedding of a word with different meanings is the same, but in the final embedding, these embeddings will be different depending on the meaning of the word in the sentence.

BERT token vs. embedding

Tags:

token

embedding

bert-language-model

transformer-model

i82much

2 Answers

noe

mostafa amiri

Recent Activity

Donate For Us

BERT token vs. embedding

Tags:

token

embedding

bert-language-model

transformer-model

i82much

2 Answers

noe

mostafa amiri

Related questions

Recent Activity

Donate For Us