what's difference between tokenizer.encode and tokenizer.encode_plus in Hugging Face

Tags:

huggingface-transformers

Here is an example of doing sequence classification using a model to determine if two sequences are paraphrases of each other. The two examples give two different results. Can you help me explain why tokenizer.encode and tokenizer.encode_plus give different results?

Example 1 (with .encode_plus()):

paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase)[0]
not_paraphrase_classification_logits = model(**not_paraphrase)[0]

Example 2 (with .encode()):

paraphrase = tokenizer.encode(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer.encode(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(paraphrase)[0]
not_paraphrase_classification_logits = model(not_paraphrase)[0]

678

asked May 10 '20 07:05

andy

1 Answers

The main difference is stemming from the additional information that encode_plus is providing. If you read the documentation on the respective functions, then there is a slight difference forencode():

Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary. Same as doing self.convert_tokens_to_ids(self.tokenize(text)).

and the description of encode_plus():

Returns a dictionary containing the encoded sequence or sequence pair and additional information: the mask for sequence classification and the overflowing elements if a max_length is specified.

Depending on your specified model and input sentence, the difference lies in the additionally encoded information, specifically the input mask. Since you are feeding in two sentences at a time, BERT (and likely other model variants), expect some form of masking, which allows the model to discern between the two sequences, see here. Since encode_plus is providing this information, but encode isn't, you get different output results.

189

answered Sep 20 '22 17:09

dennlinger

Related questions
                            
                                Updating a BERT model through Huggingface transformers
                            
                                Training TFBertForSequenceClassification with custom X and Y data
                            
                                Get probability of multi-token word in MASK position
                            
                                Tokens to Words mapping in the tokenizer decode step huggingface?
                            
                                How to get immediate next word probability using GPT2 model?
                            
                                How to load the saved tokenizer from pretrained model
                            
                                BertModel transformers outputs string instead of tensor
                            
                                Optimizer and scheduler for BERT fine-tuning
                            
                                How to use 'collate_fn' with dataloaders?
                            
                                How to change huggingface transformers default cache directory
                            
                                Huggingface saving tokenizer
                            
                                Saving and reload huggingface fine-tuned transformer
                            
                                Load a pre-trained model from disk with Huggingface Transformers
                            
                                PyTorch BERT TypeError: forward() got an unexpected keyword argument 'labels'
                            
                                How to use Hugging Face Transformers library in Tensorflow for text classification on custom data?
                            
                                ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error
                            
                                Transformers v4.x: Convert slow tokenizer to fast tokenizer
                            
                                BertForSequenceClassification vs. BertForMultipleChoice for sentence multi-class classification
                            
                                How to disable TOKENIZERS_PARALLELISM=(true | false) warning?
                            
                                Where does hugging face's transformers save models?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With