In the paper describing BERT, there is this paragraph about WordPiece Embeddings.
We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B. As shown in Figure 1, we denote input embedding as E, the final hidden vector of the special [CLS] token as C 2 RH, and the final hidden vector for the ith input token as Ti 2 RH. For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualization of this construction can be seen in Figure 2.
As I understand, WordPiece splits Words into wordpieces like #I #like #swim #ing, but it does not generate Embeddings. But I did not find anything in the paper and on other sources how those Token Embeddings are generated. Are they pretrained before the actual Pre-training? How? Or are they randomly initialized?
The model must know whether a particular token belongs to sentence A or sentence B in BERT. This is achieved by generating another, fixed token, called the segment embedding – a fixed token for sentence A and one for sentence B. There are just two vector representations in the Segment Embeddings layer.
As mentioned earlier, BERT embeddings are used by many products across the Internet in a variety of recommendation tasks. We convert any user-generated content into embeddings to obtain a numerical vector representation of these inputs.
Sentence-BERT uses a Siamese network like architecture to provide 2 sentences as an input. These 2 sentences are then passed to BERT models and a pooling layer to generate their embeddings. Then use the embeddings for the pair of sentences as inputs to calculate the cosine similarity.
To generate embeddings, you can choose either an autoencoder or a predictor. Remember, your default choice is an autoencoder. You choose a predictor instead if specific features in your dataset determine similarity. For completeness, let's look at both cases.
The wordpieces are trained separately, such the most frequent words remain together and the less frequent words get split eventually down to characters.
The embeddings are trained jointly with the rest of BERT. The back-propagation is done through all the layers up to the embeddings which get updated just like any other parameters in the network.
Note that only the embeddings of tokens which are actually present in the training batch get updated and the rest remain unchanged. This also a reason why you need to have relatively small word-piece vocabulary, such that all embeddings get updated frequently enough during the training.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With