Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BertWordPieceTokenizer vs BertTokenizer from HuggingFace

I have the following pieces of code and trying to understand the difference between BertWordPieceTokenizer and BertTokenizer.

BertWordPieceTokenizer (Rust based)

from tokenizers import BertWordPieceTokenizer

sequence = "Hello, y'all! How are you Tokenizer 😁 ?"
tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt")
tokenized_sequence = tokenizer.encode(sequence)
print(tokenized_sequence)
>>>Encoding(num_tokens=15, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

print(tokenized_sequence.tokens)
>>>['[CLS]', 'hello', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', 'token', '##izer', '[UNK]', '?', '[SEP]']

BertTokenizer

from transformers import BertTokenizer
tokenizer = BertTokenizer("bert-base-cased-vocab.txt")
tokenized_sequence = tokenizer.encode(sequence)
print(tokenized_sequence)
#Output: [19082, 117, 194, 112, 1155, 106, 1293, 1132, 1128, 22559, 17260, 100, 136]
  1. Why is encode working differently in both ? In BertWordPieceTokenizer it gives Encoding object while in BertTokenizer it gives the ids of the vocab.
  2. What is the Difference between BertWordPieceTokenizer and BertTokenizer fundamentally, because as I understand BertTokenizer also uses WordPiece under the hood.

Thanks

like image 544
HopeKing Avatar asked Dec 18 '22 13:12

HopeKing


1 Answers

They should produce the same output when you use the same vocabulary (in your example you have used bert-base-uncased-vocab.txt and bert-base-cased-vocab.txt). The main difference is that the tokenizers from the tokenizers package are faster as the tokenizers from transformers because they are implemented in Rust.

When you modify your example you will see that they produce the same ids and other attributes (encoding object) while the transformers tokenizer only have produced the a list of ids:

from tokenizers import BertWordPieceTokenizer

sequence = "Hello, y'all! How are you Tokenizer 😁 ?"
tokenizerBW = BertWordPieceTokenizer("/content/bert-base-uncased-vocab.txt")
tokenized_sequenceBW = tokenizerBW.encode(sequence)
print(tokenized_sequenceBW)
print(type(tokenized_sequenceBW))
print(tokenized_sequenceBW.ids)

Output:

Encoding(num_tokens=15, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
<class 'Encoding'>
[101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 19204, 17629, 100, 1029, 102]
from transformers import BertTokenizer

tokenizerBT = BertTokenizer("/content/bert-base-uncased-vocab.txt")
tokenized_sequenceBT = tokenizerBT.encode(sequence)
print(tokenized_sequenceBT)
print(type(tokenized_sequenceBT))

Output:

[101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 19204, 17629, 100, 1029, 102]
<class 'list'>

You mentioned in the comments that your questions is more about why the produced output is different. As far as I can tell this was a design decision made by the developers and there is no specific reason for that. It is also not a the case that BertWordPieceTokenizer from tokenizers is an in-place replacement for the BertTokenizer from transformers. They still use a wrapper to make it compatible with with the transformers tokenizer API. There is a BertTokenizerFast class which has a "clean up" method _convert_encoding to make the BertWordPieceTokenizer fully compatible. Therefore you have to compare the BertTokenizer example above with the following:

from transformers import BertTokenizerFast

sequence = "Hello, y'all! How are you Tokenizer 😁 ?"
tokenizerBW = BertTokenizerFast.from_pretrained("bert-base-uncased")
tokenized_sequenceBW = tokenizerBW.encode(sequence)
print(tokenized_sequenceBW)
print(type(tokenized_sequenceBW))

Output:

[101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 19204, 17629, 100, 1029, 102]
<class 'list'>

From my perspective they have build the tokenizers library independently from the transformers library with the objective to be fast and useful.

like image 93
cronoik Avatar answered Jan 06 '23 02:01

cronoik