Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding BERT vocab [unusedxxx] tokens:

I am trying to understand BERT vocab here. It has 1000 [unusedxxx] tokens. I don't follow the usage of these tokens. I understand other special tokens like [SEP], [CLS], but what is [unused] used for?

Thanks!

like image 770
user12769533 Avatar asked Jun 18 '20 14:06

user12769533


People also ask

What are tokens in BERT?

Input-Output Format BERT uses special tokens [CLS] and [SEP] to understand input properly. [SEP] token has to be inserted at the end of a single input.

How are words tokenized in BERT?

BERT uses what is called a WordPiece tokenizer. It works by splitting words either into the full forms (e.g., one word becomes one token) or into word pieces — where one word can be broken into multiple tokens. An example of where this can be useful is where we have multiple forms of words.

What are special tokens?

Special tokens are called special because they are not derived from your input. They are added for a certain purpose and are independent of the specific input.

What is Head_mask in BERT?

head_mask: Mask to nullify selected heads of the self-attention modules.


1 Answers

A quick search reveals the use of this, specifically in the discussion of the original BERT implementation, and this HuggingFace thread.

Unused tokens are helpful if you want to introduce specific words to your fine-tuning or further pre-training procedure; they allow you to treat words that are relevant only in your context just like you want, and avoid subword splitting that would occur with the original vocabulary of BERT. To quote from the first discussion:

Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.

like image 92
dennlinger Avatar answered Oct 16 '22 20:10

dennlinger