Is there a way to know the mapping from the tokens back to the original words in the tokenizer.decode()
function?
For example:
from transformers.tokenization_roberta import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)
str = "This is a tokenization example"
tokenized = tokenizer.tokenize(str)
## ['this', 'Ġis', 'Ġa', 'Ġtoken', 'ization', 'Ġexample']
encoded = tokenizer.encode_plus(str)
## encoded['input_ids']=[0, 42, 16, 10, 19233, 1938, 1246, 2]
decoded = tokenizer.decode(encoded['input_ids'])
## '<s> this is a tokenization example</s>'
And the objective is to have a function that maps each token in the decode
process to the correct input word, for here it will be:desired_output = [[1],[2],[3],[4,5],[6]]
As this
corresponds to id 42
, while token
and ization
corresponds to ids [19244,1938]
which are at indexes 4,5
of the input_ids
array.
Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data.
Training the tokenizerStart with all the characters present in the training corpus as tokens. Identify the most common pair of tokens and merge it into one token. Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want.
For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token's start position and end position relative to the original token it was split from. That means that if the first position in the tuple is anything other than 0 , we will set its corresponding label to -100 .
One such subword tokenization technique that is commonly used and can be applied to many other NLP models is called WordPiece. Given text, WordPiece first pre-tokenizes the text into words (by splitting on punctuation and whitespaces) and then tokenizes each word into subword units, called wordpieces.
As far as I know their is no built-in method for that, but you can create one by yourself:
from transformers.tokenization_roberta import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)
example = "This is a tokenization example"
print({x : tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()})
Output:
{'This': [42], 'is': [16], 'a': [10], 'tokenization': [19233, 1938], 'example': [1246]}
To get exactly your desired output, you have to work with a list comprehension:
#start index because the number of special tokens is fixed for each model (but be aware of single sentence input and pairwise sentence input)
idx = 1
enc =[tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()]
desired_output = []
for token in enc:
tokenoutput = []
for ids in token:
tokenoutput.append(idx)
idx +=1
desired_output.append(tokenoutput)
print(desired_output)
Output:
[[1], [2], [3], [4, 5], [6]]
If you use the fast tokenizers, i.e. the rust backed versions from the tokenizers
library the encoding contains a word_ids
method that can be used to map sub-words back to their original word. What constitutes a word
vs a subword
depends on the tokenizer, a word is something generated by the pre-tokenization stage, i.e. split by whitespace, a subword is generated by the actual model (BPE
or Unigram
for example).
The code below should work in general, even if the pre-tokenization performs additional splitting. For example I created my own custom step that splits based on PascalCase - the words
here are Pascal
and Case
, the accepted answer wont work in this case since it assumes words are whitespace delimited.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('roberta-large', do_lower_case=True)
example = "This is a tokenization example"
encoded = tokenizer(example)
desired_output = []
for word_id in encoded.word_ids():
if word_id is not None:
start, end = encoded.word_to_tokens(word_id)
if start == end - 1:
tokens = [start]
else:
tokens = [start, end-1]
if len(desired_output) == 0 or desired_output[-1] != tokens:
desired_output.append(tokens)
desired_output
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With