How are the TokenEmbeddings in BERT created?

Tags:

In the paper describing BERT, there is this paragraph about WordPiece Embeddings.

We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B. As shown in Figure 1, we denote input embedding as E, the final hidden vector of the special [CLS] token as C 2 RH, and the final hidden vector for the ith input token as Ti 2 RH. For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualization of this construction can be seen in Figure 2.

As I understand, WordPiece splits Words into wordpieces like #I #like #swim #ing, but it does not generate Embeddings. But I did not find anything in the paper and on other sources how those Token Embeddings are generated. Are they pretrained before the actual Pre-training? How? Or are they randomly initialized?

383

asked Sep 16 '19 16:09

chefhose

1 Answers

The wordpieces are trained separately, such the most frequent words remain together and the less frequent words get split eventually down to characters.

The embeddings are trained jointly with the rest of BERT. The back-propagation is done through all the layers up to the embeddings which get updated just like any other parameters in the network.

Note that only the embeddings of tokens which are actually present in the training batch get updated and the rest remain unchanged. This also a reason why you need to have relatively small word-piece vocabulary, such that all embeddings get updated frequently enough during the training.

189

answered Nov 15 '22 06:11

Jindřich

Related questions
                            
                                Neural network, is it worth changing learning rate and momentum over time
                            
                                Echo State Network learning Mackey-Glass function, but how?
                            
                                Will larger batch size make computation time less in machine learning?
                            
                                TypeError: 'numpy.float64' object is not iterable Keras
                            
                                Custom kernels for SVM, when to apply them?
                            
                                TensorFlow: Does it only have SGD algorithms? or does it also have others like LBFGS
                            
                                expand MNIST - elastic deformations MATLAB
                            
                                Python - machine learning
                            
                                How to reduce a fully-connected (`"InnerProduct"`) layer using truncated SVD
                            
                                keras usage of the Activation layer instead of activation parameter
                            
                                Core ML model conversion fails with "Unable to infer input name and dimensions"
                            
                                How to extract False Positive, False Negative from a confusion matrix of multiclass classification
                            
                                How to conditionally assign values to tensor [masking for loss function]?
                            
                                Tensorflow Error: "Label IDs must < n_classes", but my Label IDs appear to meet this requirement already
                            
                                How to split a model trained in keras?
                            
                                Handle invalid/corrupted image files in ImageDataGenerator.flow_from_directory in Keras
                            
                                XGBModel' object has no attribute 'evals_result_'
                            
                                How to train a neural network model with bert embeddings instead of static embeddings like glove/fasttext?
                            
                                Regarding odd image dimensions in Pytorch
                            
                                How inverting the dropout compensates the effect of dropout and keeps expected values unchanged?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How are the TokenEmbeddings in BERT created?

Tags:

machine-learning

nlp

word-embedding

chefhose

People also ask

1 Answers

Jindřich

Recent Activity

Donate For Us