BertTokenizer - when encoding and decoding sequences extra spaces appear

Tags:

When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method.

I have a the following string:

test_string = 'text with percentage%'

Then I am running the following code:

import torch
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

test_string = 'text with percentage%'

# encode Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
input_ids = tokenizer.encode(test_string)
output = tokenizer.decode(input_ids)

And the output looks like this:

'text with percentage %'

With an extra space before the %. I have tried the extra arguments like clean_up_tokenization_spaces but this is for something different.

How what should I use in the decoding and encoding to get exactly the same text before and after. This also happens for other special signs.

530

asked Nov 21 '19 16:11

Henryk Borzymowski

Video Answer

2 Answers

If you are trying to use BERT for token classification in order to find a span in your original string, then one workaround is to use BertTokenizerFast with the option return_offsets_mapping=True.

test_string = 'text with percentage%'

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
tokens = tokenizer(test_string, return_offsets_mapping=True)
input_ids = tokens.data["input_ids"]

span_start_index, span_stop_index = some_model(input_ids)

Then once you get the token classification results, you can do something like

predicted_span = test_string[tokens.encodings[0].offsets[span_start_index][0]:tokens.encodings[0].offsets[span_stop_index][1]]

125

answered Oct 18 '22 14:10

vermouth

According to https://github.com/huggingface/transformers/pull/1274 they're working on it. hopefully there will be a solution sometime next week.

answered Oct 18 '22 14:10

Anjie Guo

Related questions
                            
                                cannot unpack non-iterable numpy.float64 object python3 opencv
                            
                                ValueError: Unknown layer:name when loading a keras model
                            
                                How to install Plotly for Python 3 Jupyter Notebook?
                            
                                tensorflow gradient - getting all nan values
                            
                                Sympy - Rename part of an expression
                            
                                How to rearrange an Ordered Dictionary with a based on part of the key from a list
                            
                                Error pickling a `matlab` object in joblib `Parallel` context
                            
                                What does distutils do with the "requires" metadata?
                            
                                Auto Import and Refactor (Move) function from one file to another in vscode
                            
                                dataclasses: how to ignore None values using asdict()?
                            
                                Is there a numerically optimal order of matrix multiplication?
                            
                                How to configure pytest to avoid collection failure on missing imports?
                            
                                Different ways of getting Ethereum txpool pending transactions at Infura node via Web3.py
                            
                                How to return dictonary or json if I use psycopg2?
                            
                                'dict' object has no attribute 'pk' when using Django bulk_create() function
                            
                                Keras predict() returns a better accuracy than evaluate()
                            
                                Is it possible to load a pretrained Pytorch model from a GCS bucket URL without first persisting locally?
                            
                                How to write your own async/awaitable coroutine function in Python?
                            
                                AWS Lambda python: .so module: ModuleNotFoundError: No module named 'regex._regex' when in subshell
                            
                                Django - Template rendering performance (I think) how to check if enabling LocMemCache is working?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

BertTokenizer - when encoding and decoding sequences extra spaces appear

Tags:

python

tokenize

pytorch

torch

bert-language-model