Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BertTokenizer - when encoding and decoding sequences extra spaces appear

When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method.

I have a the following string:

test_string = 'text with percentage%'

Then I am running the following code:

import torch
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

test_string = 'text with percentage%'

# encode Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
input_ids = tokenizer.encode(test_string)
output = tokenizer.decode(input_ids)

And the output looks like this:

'text with percentage %'

With an extra space before the %. I have tried the extra arguments like clean_up_tokenization_spaces but this is for something different.

How what should I use in the decoding and encoding to get exactly the same text before and after. This also happens for other special signs.

like image 530
Henryk Borzymowski Avatar asked Nov 21 '19 16:11

Henryk Borzymowski


People also ask

What is hugging face tokenizer?

A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a “Fast” implementation based on the Rust library 🤗 Tokenizers.

What is a special token?

Special tokens are called special because they are not derived from your input. They are added for a certain purpose and are independent of the specific input.

What is the output of tokenizer?

The output is a list of tuples, with each tuple containing one word and its span in the original sentence (which is used to determine the final offsets of our Encoding ).

What is BatchEncoding?

BatchEncoding holds the output of the tokenizer's encoding methods ( __call__ , encode_plus and batch_encode_plus ) and is derived from a Python dictionary.


Video Answer


2 Answers

If you are trying to use BERT for token classification in order to find a span in your original string, then one workaround is to use BertTokenizerFast with the option return_offsets_mapping=True.

test_string = 'text with percentage%'

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
tokens = tokenizer(test_string, return_offsets_mapping=True)
input_ids = tokens.data["input_ids"]

span_start_index, span_stop_index = some_model(input_ids)

Then once you get the token classification results, you can do something like

predicted_span = test_string[tokens.encodings[0].offsets[span_start_index][0]:tokens.encodings[0].offsets[span_stop_index][1]]
like image 125
vermouth Avatar answered Oct 18 '22 14:10

vermouth


According to https://github.com/huggingface/transformers/pull/1274 they're working on it. hopefully there will be a solution sometime next week.

like image 26
Anjie Guo Avatar answered Oct 18 '22 14:10

Anjie Guo