I feel confused when using the Roberta tokenizer in Huggingface.
>>> tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
>>> x = tokenizer.tokenize("The tiger is ___ (big) than the dog.")
['The', 'Ġtiger', 'Ġis', 'Ġ___', 'Ġ(', 'big', ')', 'Ġthan', 'Ġthe', 'Ġdog', '.']
>>> x = tokenizer.tokenize("The tiger is ___ ( big ) than the dog.")
['The', 'Ġtiger', 'Ġis', 'Ġ___', 'Ġ(', 'Ġbig', 'Ġ)', 'Ġthan', 'Ġthe', 'Ġdog', '.']
>>> x = tokenizer.encode("The tiger is ___ (big) than the dog.")
[0, 20, 23921, 16, 2165, 36, 8527, 43, 87, 5, 2335, 4, 2]
>>> x = tokenizer.encode("The tiger is ___ ( big ) than the dog.")
[0, 20, 23921, 16, 2165, 36, 380, 4839, 87, 5, 2335, 4, 2]
>>>
Question: (big)
and ( big )
have different tokenization results, which result in different token id as well. Which one I should use? Does it mean that I should pre-tokenize the input first to make it ( big )
and go for RobertaTokenization? Or it doesn't really matter?
Secondly, it seems BertTokenizer
has no such confusion:
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> x = tokenizer.tokenize("The tiger is ___ (big) than the dog.")
['the', 'tiger', 'is', '_', '_', '_', '(', 'big', ')', 'than', 'the', 'dog', '.']
>>> x = tokenizer.tokenize("The tiger is ___ ( big ) than the dog.")
['the', 'tiger', 'is', '_', '_', '_', '(', 'big', ')', 'than', 'the', 'dog', '.']
>>>
BertTokenizer
gives me the same results using the wordpieces.
Any thoughts to help me better understand the RobertaTokenizer, which I know is using Byte-Pair Encoding?
Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.
Tokenization is the process of dividing text into a set of meaningful pieces. These pieces are called tokens. For example, we can divide a chunk of text into words, or we can divide it into sentences.
If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
Special tokens added by the tokenizer are mapped to None and other tokens are mapped to the index of their corresponding word (several tokens will be mapped to the same word index if they are parts of that word). Return a list mapping the tokens to their actual word in the initial sentence for a fast tokenizer.
It is difficult to perform as the process of reading and understanding languages is far more complex than it seems at first glance. Tokenization is the process of tokenizing or splitting a string, text into a list of tokens.
We’ll train a RoBERTa model, which is BERT-like with a couple of changes (check the documentation for more details). In summary: “It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates”, Huggingface documentation on RoBERTa.
Hugingface's Transformers are designed such that you are not supposed to do any pre-tokenization.
RoBERTa uses SentecePiece which has lossless pre-tokenization. I.e., when you have a tokenized text, you should always be able to say how the text looked like before tokenization. The Ġ
(which is ▁
, a weird Unicode underscore in the original SentecePiece) says that there should be a space when you detokenize. As a consequence big
and ▁big
end up as different tokens. Of course, in this particular context, it does not make much sense because it is obviously still the same word, but this the price you pay for lossless tokenization and also how RoBERTa was trained.
BERT uses WordPiece, which does not suffer from this problem. On the other hand, the mapping between the original string and the tokenized text is not as straightforward (which might be inconvenient, e.g., when you want to highlight something in a user-generated text).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With