Tokens to Words mapping in the tokenizer decode step huggingface?

Tags:

Is there a way to know the mapping from the tokens back to the original words in the tokenizer.decode() function?
For example:

Click to copy

from transformers.tokenization_roberta import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)

str = "This is a tokenization example"
tokenized = tokenizer.tokenize(str) 
## ['this', 'Ġis', 'Ġa', 'Ġtoken', 'ization', 'Ġexample']

encoded = tokenizer.encode_plus(str) 
## encoded['input_ids']=[0, 42, 16, 10, 19233, 1938, 1246, 2]

decoded = tokenizer.decode(encoded['input_ids']) 
## '<s> this is a tokenization example</s>'

And the objective is to have a function that maps each token in the decode process to the correct input word, for here it will be:
desired_output = [[1],[2],[3],[4,5],[6]]
As this corresponds to id 42, while token and ization corresponds to ids [19244,1938] which are at indexes 4,5 of the input_ids array.

307

asked Jun 11 '20 05:06

DsCpp

Video Answer

2 Answers

As far as I know their is no built-in method for that, but you can create one by yourself:

Click to copy

from transformers.tokenization_roberta import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)

example = "This is a tokenization example"

print({x : tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()})

Output:

Click to copy

{'This': [42], 'is': [16], 'a': [10], 'tokenization': [19233, 1938], 'example': [1246]}

To get exactly your desired output, you have to work with a list comprehension:

Click to copy

#start index because the number of special tokens is fixed for each model (but be aware of single sentence input and pairwise sentence input)
idx = 1

enc =[tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()]

desired_output = []

for token in enc:
    tokenoutput = []
    for ids in token:
      tokenoutput.append(idx)
      idx +=1
    desired_output.append(tokenoutput)

print(desired_output)

Output:

Click to copy

[[1], [2], [3], [4, 5], [6]]

163

answered Nov 13 '22 08:11

cronoik

If you use the fast tokenizers, i.e. the rust backed versions from the tokenizers library the encoding contains a word_ids method that can be used to map sub-words back to their original word. What constitutes a word vs a subword depends on the tokenizer, a word is something generated by the pre-tokenization stage, i.e. split by whitespace, a subword is generated by the actual model (BPE or Unigram for example).

The code below should work in general, even if the pre-tokenization performs additional splitting. For example I created my own custom step that splits based on PascalCase - the words here are Pascal and Case, the accepted answer wont work in this case since it assumes words are whitespace delimited.

Click to copy

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('roberta-large', do_lower_case=True)

example = "This is a tokenization example"

encoded = tokenizer(example)

desired_output = []
for word_id in encoded.word_ids():
    if word_id is not None:
        start, end = encoded.word_to_tokens(word_id)
        if start == end - 1:
            tokens = [start]
        else:
            tokens = [start, end-1]
        if len(desired_output) == 0 or desired_output[-1] != tokens:
            desired_output.append(tokens)
desired_output

answered Nov 13 '22 06:11

David Waterworth

Related questions
                            
                                Overflow when unpacking long - Pytorch
                            
                                Expected object of type torch.FloatTensor but found type torch.cuda.FloatTensor for argument #2 'weight'
                            
                                Fastai learner not loading
                            
                                Does batch_first affect hidden tensors in Pytorch LSTMs?
                            
                                Installing PyTorch under conda fails with permissions error and Rolling back transaction
                            
                                Keras vs PyTorch LSTM different results
                            
                                How to get the output from a specific layer from a PyTorch model?
                            
                                How to assign a new value to a pytorch Variable without breaking backpropagation?
                            
                                How do they know mean and std, the input value of transforms.Normalize
                            
                                Get probability of multi-token word in MASK position
                            
                                How to get quick documentation working with PyCharm and Pytorch
                            
                                Understanding when to call zero_grad() in pytorch, when training with multiple losses
                            
                                How do I list all currently available GPUs with pytorch?
                            
                                Difference between 'ctx' and 'self' in python?
                            
                                Comparing Conv2D with padding between Tensorflow and PyTorch
                            
                                How to create a torchtext.data.TabularDataset directly from a list or dict
                            
                                Size mismatch for fc.bias and fc.weight in PyTorch
                            
                                Are there any computational efficiency differences between nn.functional() Vs nn.sequential() in PyTorch
                            
                                How to specify pytorch / cuda version in pipenv
                            
                                Tracing back deprecated warning in pytorch

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Tokens to Words mapping in the tokenizer decode step huggingface?

Tags:

tokenize

pytorch

huggingface-transformers

DsCpp

People also ask

Video Answer

2 Answers

cronoik

David Waterworth

Recent Activity

Donate For Us