Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tokens to Words mapping in the tokenizer decode step huggingface?

Is there a way to know the mapping from the tokens back to the original words in the tokenizer.decode() function?
For example:

from transformers.tokenization_roberta import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)

str = "This is a tokenization example"
tokenized = tokenizer.tokenize(str) 
## ['this', 'Ġis', 'Ġa', 'Ġtoken', 'ization', 'Ġexample']

encoded = tokenizer.encode_plus(str) 
## encoded['input_ids']=[0, 42, 16, 10, 19233, 1938, 1246, 2]

decoded = tokenizer.decode(encoded['input_ids']) 
## '<s> this is a tokenization example</s>'

And the objective is to have a function that maps each token in the decode process to the correct input word, for here it will be:
desired_output = [[1],[2],[3],[4,5],[6]]
As this corresponds to id 42, while token and ization corresponds to ids [19244,1938] which are at indexes 4,5 of the input_ids array.

like image 307
DsCpp Avatar asked Jun 11 '20 05:06

DsCpp


People also ask

What does Huggingface tokenizer do?

Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data.

How do you train a tokenizer Huggingface?

Training the tokenizerStart with all the characters present in the training corpus as tokens. Identify the most common pair of tokens and merge it into one token. Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want.

What is tokenizer offset mapping?

For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token's start position and end position relative to the original token it was split from. That means that if the first position in the tuple is anything other than 0 , we will set its corresponding label to -100 .

What is word piece tokenizer?

One such subword tokenization technique that is commonly used and can be applied to many other NLP models is called WordPiece. Given text, WordPiece first pre-tokenizes the text into words (by splitting on punctuation and whitespaces) and then tokenizes each word into subword units, called wordpieces.


Video Answer


2 Answers

As far as I know their is no built-in method for that, but you can create one by yourself:

from transformers.tokenization_roberta import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)

example = "This is a tokenization example"

print({x : tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()})

Output:

{'This': [42], 'is': [16], 'a': [10], 'tokenization': [19233, 1938], 'example': [1246]}

To get exactly your desired output, you have to work with a list comprehension:

#start index because the number of special tokens is fixed for each model (but be aware of single sentence input and pairwise sentence input)
idx = 1

enc =[tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()]

desired_output = []

for token in enc:
    tokenoutput = []
    for ids in token:
      tokenoutput.append(idx)
      idx +=1
    desired_output.append(tokenoutput)

print(desired_output)

Output:

[[1], [2], [3], [4, 5], [6]]
like image 163
cronoik Avatar answered Nov 13 '22 08:11

cronoik


If you use the fast tokenizers, i.e. the rust backed versions from the tokenizers library the encoding contains a word_ids method that can be used to map sub-words back to their original word. What constitutes a word vs a subword depends on the tokenizer, a word is something generated by the pre-tokenization stage, i.e. split by whitespace, a subword is generated by the actual model (BPE or Unigram for example).

The code below should work in general, even if the pre-tokenization performs additional splitting. For example I created my own custom step that splits based on PascalCase - the words here are Pascal and Case, the accepted answer wont work in this case since it assumes words are whitespace delimited.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('roberta-large', do_lower_case=True)

example = "This is a tokenization example"

encoded = tokenizer(example)

desired_output = []
for word_id in encoded.word_ids():
    if word_id is not None:
        start, end = encoded.word_to_tokens(word_id)
        if start == end - 1:
            tokens = [start]
        else:
            tokens = [start, end-1]
        if len(desired_output) == 0 or desired_output[-1] != tokens:
            desired_output.append(tokens)
desired_output
like image 40
David Waterworth Avatar answered Nov 13 '22 06:11

David Waterworth