Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do I need to pre-tokenize the text first before using HuggingFace's RobertaTokenizer? (Different undersanding)

I feel confused when using the Roberta tokenizer in Huggingface.

>>> tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
>>> x = tokenizer.tokenize("The tiger is ___ (big) than the dog.")
['The', 'Ġtiger', 'Ġis', 'Ġ___', 'Ġ(', 'big', ')', 'Ġthan', 'Ġthe', 'Ġdog', '.']
>>> x = tokenizer.tokenize("The tiger is ___ ( big ) than the dog.")
['The', 'Ġtiger', 'Ġis', 'Ġ___', 'Ġ(', 'Ġbig', 'Ġ)', 'Ġthan', 'Ġthe', 'Ġdog', '.']
>>> x = tokenizer.encode("The tiger is ___ (big) than the dog.")
[0, 20, 23921, 16, 2165, 36, 8527, 43, 87, 5, 2335, 4, 2]
>>> x = tokenizer.encode("The tiger is ___ ( big ) than the dog.")
[0, 20, 23921, 16, 2165, 36, 380, 4839, 87, 5, 2335, 4, 2]
>>>

Question: (big) and ( big ) have different tokenization results, which result in different token id as well. Which one I should use? Does it mean that I should pre-tokenize the input first to make it ( big ) and go for RobertaTokenization? Or it doesn't really matter?

Secondly, it seems BertTokenizer has no such confusion:

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> x = tokenizer.tokenize("The tiger is ___ (big) than the dog.")
['the', 'tiger', 'is', '_', '_', '_', '(', 'big', ')', 'than', 'the', 'dog', '.']
>>> x = tokenizer.tokenize("The tiger is ___ ( big ) than the dog.")
['the', 'tiger', 'is', '_', '_', '_', '(', 'big', ')', 'than', 'the', 'dog', '.']
>>>

BertTokenizer gives me the same results using the wordpieces.

Any thoughts to help me better understand the RobertaTokenizer, which I know is using Byte-Pair Encoding?

like image 897
Allan-J Avatar asked Jun 17 '20 06:06

Allan-J


People also ask

Why does the words have to be tokenized before doing sentiment analysis?

Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.

What is tokenization in text preprocessing?

Tokenization is the process of dividing text into a set of meaningful pieces. These pieces are called tokens. For example, we can divide a chunk of text into words, or we can divide it into sentences.

What does the tokenizer do when set to true?

If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.

How are tokens added by the tokenizer mapped to words?

Special tokens added by the tokenizer are mapped to None and other tokens are mapped to the index of their corresponding word (several tokens will be mapped to the same word index if they are parts of that word). Return a list mapping the tokens to their actual word in the initial sentence for a fast tokenizer.

Why is it so difficult to perform tokenization?

It is difficult to perform as the process of reading and understanding languages is far more complex than it seems at first glance. Tokenization is the process of tokenizing or splitting a string, text into a list of tokens.

Is huggingface’s Roberta model bert-like?

We’ll train a RoBERTa model, which is BERT-like with a couple of changes (check the documentation for more details). In summary: “It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates”, Huggingface documentation on RoBERTa.


1 Answers

Hugingface's Transformers are designed such that you are not supposed to do any pre-tokenization.

RoBERTa uses SentecePiece which has lossless pre-tokenization. I.e., when you have a tokenized text, you should always be able to say how the text looked like before tokenization. The Ġ (which is , a weird Unicode underscore in the original SentecePiece) says that there should be a space when you detokenize. As a consequence big and ▁big end up as different tokens. Of course, in this particular context, it does not make much sense because it is obviously still the same word, but this the price you pay for lossless tokenization and also how RoBERTa was trained.

BERT uses WordPiece, which does not suffer from this problem. On the other hand, the mapping between the original string and the tokenized text is not as straightforward (which might be inconvenient, e.g., when you want to highlight something in a user-generated text).

like image 134
Jindřich Avatar answered Jan 03 '23 08:01

Jindřich