Difficulty in understanding the tokenizer used in Roberta model

Tags:

from transformers import AutoModel, AutoTokenizer

tokenizer1 = AutoTokenizer.from_pretrained("roberta-base")
tokenizer2 = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "A Titan RTX has 24GB of VRAM"
print(tokenizer1.tokenize(sequence))
print(tokenizer2.tokenize(sequence))

Output:

['A', 'ĠTitan', 'ĠRTX', 'Ġhas', 'Ġ24', 'GB', 'Ġof', 'ĠVR', 'AM']

['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']

Bert model uses WordPiece tokenizer. Any word that does not occur in the WordPiece vocabulary is broken down into sub-words greedily. For example, 'RTX' is broken into 'R', '##T' and '##X' where ## indicates it is a subtoken.

Roberta uses BPE tokenizer but I'm unable to understand

a) how BPE tokenizer works?

b) what does G represents in each of tokens?

593

asked Apr 10 '20 04:04

Mr. NLP

Video Answer

1 Answers

This question is extremely broad, so I'm trying to give an answer that focuses on the main problem at hand. If you feel the need to have other questions answered, please open another question focusing on one question at a time, see the [help/on-topic] rules for Stackoverflow.

Essentially, as you've correctly identified, BPE is central to any tokenization in modern deep networks. I highly recommend you to read the original BPE paper by Sennrich et al., in which they also highlight a bit more of the history of BPEs.
In any case, the tokenizers for any of the huggingface models are pretrained, meaning that they are usually generated from the training set of the algorithm beforehand. Common implementations such as SentencePiece also give a bit better understanding of it, but essentially the task is framed as a constrained optimization problem, where you specify a maximum number of k allowed vocabulary words (the constraint), and the algorithm tries to then keep as many words intact without exceeding k.

if there are not enough words to cover the whole vocabulary, smaller units are used to approximate the vocabulary, which results in the splits observed in the example you gave. RoBERTa uses a variant called "byte-level BPE", the best explanation is probably given in this study by Wang et al.. The main benefit is, that it results in a smaller vocabulary while maintaining the quality of splits, from what I understand.

The second part of your question is easier to explain; while BERT highlights the merging of two subsequent tokens (with ##), RoBERTa's tokenizer instead highlights the start of a new token with a specific unicode character (in this case, \u0120, the G with a dot). The best reason I could find for this was this thread, which argues that it basically avoids the use of whitespaces in training.

answered Sep 19 '22 01:09

dennlinger

Related questions
                            
                                Dependency parsing tree in Spacy
                            
                                Sinusoidal embedding - Attention is all you need
                            
                                Extracting Key-Phrases from text based on the Topic with Python
                            
                                Algorithm to understand meaning [closed]
                            
                                Is POS tagging deterministic?
                            
                                Sentence detection using NLP
                            
                                How to filter out words with low tf-idf in a corpus with gensim?
                            
                                What is the acl tag in Stanford dependency parsing?
                            
                                How to split an NLP parse tree to clauses (independent and subordinate)?
                            
                                Reduce Google's Word2Vec model with Gensim
                            
                                Generating dictionaries to categorize tweets into pre-defined categories using NLTK
                            
                                Simple spell checking algorithm
                            
                                Natural Language Processing for Smart Homes
                            
                                Using Stanford Parser(CoreNLP) to find phrase heads
                            
                                Bytes vs Characters vs Words - which granularity for n-grams?
                            
                                Testing the NLTK classifier on specific file
                            
                                NLTK - TypeError: tagged_words() got an unexpected keyword argument 'simplify_tags'
                            
                                What to do when Seq2Seq network repeats words over and over in output?
                            
                                Spacy NLP library: what is maximum reasonable document size
                            
                                removing stop words using spacy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difficulty in understanding the tokenizer used in Roberta model

Tags:

nlp

pytorch

bert-language-model

huggingface-transformers

Mr. NLP

People also ask

Video Answer

1 Answers

dennlinger

Recent Activity

Donate For Us