I am trying to understand BERT vocab here. It has 1000 [unusedxxx] tokens. I don't follow the usage of these tokens. I understand other special tokens like [SEP], [CLS], but what is [unused] used for?
Thanks!
Input-Output Format BERT uses special tokens [CLS] and [SEP] to understand input properly. [SEP] token has to be inserted at the end of a single input.
BERT uses what is called a WordPiece tokenizer. It works by splitting words either into the full forms (e.g., one word becomes one token) or into word pieces — where one word can be broken into multiple tokens. An example of where this can be useful is where we have multiple forms of words.
Special tokens are called special because they are not derived from your input. They are added for a certain purpose and are independent of the specific input.
head_mask: Mask to nullify selected heads of the self-attention modules.
A quick search reveals the use of this, specifically in the discussion of the original BERT implementation, and this HuggingFace thread.
Unused tokens are helpful if you want to introduce specific words to your fine-tuning or further pre-training procedure; they allow you to treat words that are relevant only in your context just like you want, and avoid subword splitting that would occur with the original vocabulary of BERT. To quote from the first discussion:
Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With