Here is an example of doing sequence classification using a model to determine if two sequences are paraphrases of each other. The two examples give two different results. Can you help me explain why tokenizer.encode and tokenizer.encode_plus give different results?
Example 1 (with .encode_plus()):
paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt")
paraphrase_classification_logits = model(**paraphrase)[0]
not_paraphrase_classification_logits = model(**not_paraphrase)[0]
Example 2 (with .encode()):
paraphrase = tokenizer.encode(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer.encode(sequence_0, sequence_1, return_tensors="pt")
paraphrase_classification_logits = model(paraphrase)[0]
not_paraphrase_classification_logits = model(not_paraphrase)[0]
tokenize(text)) . and the description of encode_plus() : Returns a dictionary containing the encoded sequence or sequence pair and additional information: the mask for sequence classification and the overflowing elements if a max_length is specified.
A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a “Fast” implementation based on the Rust library 🤗 Tokenizers.
Tokenization is the process of exchanging sensitive data for nonsensitive data called "tokens" that can be used in a database or internal system without bringing it into scope.
add_special_tokens (bool, optional, defaults to True) — Whether or not to encode the sequences with the special tokens relative to their model.
The main difference is stemming from the additional information that encode_plus is providing. If you read the documentation on the respective functions, then there is a slight difference forencode():
Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary. Same as doing
self.convert_tokens_to_ids(self.tokenize(text)).
and the description of encode_plus():
Returns a dictionary containing the encoded sequence or sequence pair and additional information: the mask for sequence classification and the overflowing elements if a
max_lengthis specified.
Depending on your specified model and input sentence, the difference lies in the additionally encoded information, specifically the input mask. Since you are feeding in two sentences at a time, BERT (and likely other model variants), expect some form of masking, which allows the model to discern between the two sequences, see here. Since encode_plus is providing this information, but encode isn't, you get different output results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With