According to this link, target_vocab_size:
int, approximate size of the vocabulary to create. The statement is pretty ambiguous for me. As far as I can understand, the encoder will map each vocabulary to a unique ID. What will happen if the corpus has vocab_size
larger than the target_vocab_size
?
The documentation says:
Encoding is fully invertible because all out-of-vocab wordpieces are byte-encoded
Which means unknown word pieces will be encoded one character at a time. It's best understood through an example. Let's suppose you build a SubwordTextEncoder
using a very large corpus of English text such that most of the common words are in vocabulary.
vocab_size = 10000
tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(
corpus_sentences, vocab_size)
Let's say you try to tokenize the following sentence.
tokenizer.encode("good badwords badxyz")
It will be tokenized as:
As you can see, since the word piece "xyz" is not in vocabulary it is tokenized as characters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With