Do I need to pre-tokenize the text first before using HuggingFace's RobertaTokenizer? (Different undersanding)

Tags:

huggingface-tokenizers

I feel confused when using the Roberta tokenizer in Huggingface.

>>> tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
>>> x = tokenizer.tokenize("The tiger is ___ (big) than the dog.")
['The', 'Ġtiger', 'Ġis', 'Ġ___', 'Ġ(', 'big', ')', 'Ġthan', 'Ġthe', 'Ġdog', '.']
>>> x = tokenizer.tokenize("The tiger is ___ ( big ) than the dog.")
['The', 'Ġtiger', 'Ġis', 'Ġ___', 'Ġ(', 'Ġbig', 'Ġ)', 'Ġthan', 'Ġthe', 'Ġdog', '.']
>>> x = tokenizer.encode("The tiger is ___ (big) than the dog.")
[0, 20, 23921, 16, 2165, 36, 8527, 43, 87, 5, 2335, 4, 2]
>>> x = tokenizer.encode("The tiger is ___ ( big ) than the dog.")
[0, 20, 23921, 16, 2165, 36, 380, 4839, 87, 5, 2335, 4, 2]
>>>

Question: (big) and ( big ) have different tokenization results, which result in different token id as well. Which one I should use? Does it mean that I should pre-tokenize the input first to make it ( big ) and go for RobertaTokenization? Or it doesn't really matter?

Secondly, it seems BertTokenizer has no such confusion:

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> x = tokenizer.tokenize("The tiger is ___ (big) than the dog.")
['the', 'tiger', 'is', '_', '_', '_', '(', 'big', ')', 'than', 'the', 'dog', '.']
>>> x = tokenizer.tokenize("The tiger is ___ ( big ) than the dog.")
['the', 'tiger', 'is', '_', '_', '_', '(', 'big', ')', 'than', 'the', 'dog', '.']
>>>

BertTokenizer gives me the same results using the wordpieces.

Any thoughts to help me better understand the RobertaTokenizer, which I know is using Byte-Pair Encoding?

897

asked Jun 17 '20 06:06

Allan-J

1 Answers

Hugingface's Transformers are designed such that you are not supposed to do any pre-tokenization.

RoBERTa uses SentecePiece which has lossless pre-tokenization. I.e., when you have a tokenized text, you should always be able to say how the text looked like before tokenization. The Ġ (which is ▁, a weird Unicode underscore in the original SentecePiece) says that there should be a space when you detokenize. As a consequence big and ▁big end up as different tokens. Of course, in this particular context, it does not make much sense because it is obviously still the same word, but this the price you pay for lossless tokenization and also how RoBERTa was trained.

BERT uses WordPiece, which does not suffer from this problem. On the other hand, the mapping between the original string and the tokenized text is not as straightforward (which might be inconvenient, e.g., when you want to highlight something in a user-generated text).

134

answered Jan 03 '23 08:01

Jindřich

Related questions
                            
                                How to truncate a Bert tokenizer in Transformers library
                            
                                Torch JIT Trace = TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect
                            
                                BertWordPieceTokenizer vs BertTokenizer from HuggingFace
                            
                                How to predownload a transformers model
                            
                                Can't load TF transformer model with keras.models.load_model()
                            
                                How to extract document embeddings from HuggingFace Longformer
                            
                                Fluctuating loss during training for text binary classification
                            
                                Huggingface Summarization
                            
                                PyTorch Huggingface BERT-NLP for Named Entity Recognition
                            
                                Pretraining a language model on a small custom corpus
                            
                                How to get the probability of a particular token(word) in a sentence given the context
                            
                                How to compare sentence similarities using embeddings from BERT

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With