When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method.
I have a the following string:
test_string = 'text with percentage%'
Then I am running the following code:
import torch
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
test_string = 'text with percentage%'
# encode Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
input_ids = tokenizer.encode(test_string)
output = tokenizer.decode(input_ids)
And the output looks like this:
'text with percentage %'
With an extra space before the %. I have tried the extra arguments like clean_up_tokenization_spaces
but this is for something different.
How what should I use in the decoding and encoding to get exactly the same text before and after. This also happens for other special signs.
A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a “Fast” implementation based on the Rust library 🤗 Tokenizers.
Special tokens are called special because they are not derived from your input. They are added for a certain purpose and are independent of the specific input.
The output is a list of tuples, with each tuple containing one word and its span in the original sentence (which is used to determine the final offsets of our Encoding ).
BatchEncoding holds the output of the tokenizer's encoding methods ( __call__ , encode_plus and batch_encode_plus ) and is derived from a Python dictionary.
If you are trying to use BERT for token classification in order to find a span in your original string, then one workaround is to use BertTokenizerFast
with the option return_offsets_mapping=True
.
test_string = 'text with percentage%'
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
tokens = tokenizer(test_string, return_offsets_mapping=True)
input_ids = tokens.data["input_ids"]
span_start_index, span_stop_index = some_model(input_ids)
Then once you get the token classification results, you can do something like
predicted_span = test_string[tokens.encodings[0].offsets[span_start_index][0]:tokens.encodings[0].offsets[span_stop_index][1]]
According to https://github.com/huggingface/transformers/pull/1274 they're working on it. hopefully there will be a solution sometime next week.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With