from transformers import AutoModel, AutoTokenizer
tokenizer1 = AutoTokenizer.from_pretrained("roberta-base")
tokenizer2 = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "A Titan RTX has 24GB of VRAM"
print(tokenizer1.tokenize(sequence))
print(tokenizer2.tokenize(sequence))
Output:
['A', 'ĠTitan', 'ĠRTX', 'Ġhas', 'Ġ24', 'GB', 'Ġof', 'ĠVR', 'AM']
['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
Bert model uses WordPiece tokenizer. Any word that does not occur in the WordPiece vocabulary is broken down into sub-words greedily. For example, 'RTX' is broken into 'R', '##T' and '##X' where ## indicates it is a subtoken.
Roberta uses BPE tokenizer but I'm unable to understand
a) how BPE tokenizer works?
b) what does G represents in each of tokens?
RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.
A tokenizer breaks a string of characters, usually sentences of text, into tokens, an integer representation of the token, usually by looking for whitespace (tabs, spaces, newlines). It usually splits a sentence into words but there are many options like subwords.
One of the biggest challenges in the tokenization is the getting the boundary of the words. In English the boundary of the word is usually defined by a space and punctuation marks define the boundary of the sentences, but it is not same in all the languages.
This question is extremely broad, so I'm trying to give an answer that focuses on the main problem at hand. If you feel the need to have other questions answered, please open another question focusing on one question at a time, see the [help/on-topic] rules for Stackoverflow.
Essentially, as you've correctly identified, BPE is central to any tokenization in modern deep networks. I highly recommend you to read the original BPE paper by Sennrich et al., in which they also highlight a bit more of the history of BPEs.
In any case, the tokenizers for any of the huggingface models are pretrained, meaning that they are usually generated from the training set of the algorithm beforehand. Common implementations such as SentencePiece also give a bit better understanding of it, but essentially the task is framed as a constrained optimization problem, where you specify a maximum number of k
allowed vocabulary words (the constraint), and the algorithm tries to then keep as many words intact without exceeding k
.
if there are not enough words to cover the whole vocabulary, smaller units are used to approximate the vocabulary, which results in the splits observed in the example you gave. RoBERTa uses a variant called "byte-level BPE", the best explanation is probably given in this study by Wang et al.. The main benefit is, that it results in a smaller vocabulary while maintaining the quality of splits, from what I understand.
The second part of your question is easier to explain; while BERT highlights the merging of two subsequent tokens (with ##
), RoBERTa's tokenizer instead highlights the start of a new token with a specific unicode character (in this case, \u0120
, the G with a dot). The best reason I could find for this was this thread, which argues that it basically avoids the use of whitespaces in training.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With