Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get probability of multi-token word in MASK position

It is relatively easy to get a token's probability according to a language model, as the snippet below shows. You can get the output of a model, restrict yourself to the output of the masked token, and then find the probability of your requested token in the output vector. However, this only works with single-token words, e.g. words that are themselves in the tokenizer's vocabulary. When a word does not exist in the vocabulary, the tokenizer will chunk it up into pieces that it does know (see the bottom of the example). But since the input sentence consists of only one masked position, and the requested token has more tokens than that, how can we get its probability? Ultimately I am looking for a solution that works regardless of the number of subword units a word has.

In the code below I have added many comments explaining what is going on, as well as printing out the given output of print statements. You'll see that predicting tokens such as 'love' and 'hate' is straightforward because they are in the tokenizer's vocabulary. 'reprimand' is not, though, so it cannot be predicted in a single masked position - it consists of three subword units. So how can we predict 'reprimand' in the masked position?

from transformers import BertTokenizer, BertForMaskedLM
import torch

# init model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()
# init softmax to get probabilities later on
sm = torch.nn.Softmax(dim=0)
torch.set_grad_enabled(False)

# set sentence with MASK token, convert to token_ids
sentence = f"I {tokenizer.mask_token} you"
token_ids = tokenizer.encode(sentence, return_tensors='pt')
print(token_ids)
# tensor([[ 101, 1045,  103, 2017,  102]])
# get the position of the masked token
masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero().item()

# forward
output = model(token_ids)
last_hidden_state = output[0].squeeze(0)
# only get output for masked token
# output is the size of the vocabulary
mask_hidden_state = last_hidden_state[masked_position]
# convert to probabilities (softmax)
# giving a probability for each item in the vocabulary
probs = sm(mask_hidden_state)

# get probability of token 'hate'
hate_id = tokenizer.convert_tokens_to_ids('hate')
print('hate probability', probs[hate_id].item())
# hate probability 0.008057191967964172

# get probability of token 'love'
love_id = tokenizer.convert_tokens_to_ids('love')
print('love probability', probs[love_id].item())
# love probability 0.6704086065292358

# get probability of token 'reprimand' (?)
reprimand_id = tokenizer.convert_tokens_to_ids('reprimand')
# reprimand is not in the vocabulary, so it needs to be split into subword units
print(tokenizer.convert_ids_to_tokens(reprimand_id))
# [UNK]

reprimand_id = tokenizer.encode('reprimand', add_special_tokens=False)
print(tokenizer.convert_ids_to_tokens(reprimand_id))
# ['rep', '##rim', '##and']
# but how do we now get the probability of a multi-token word in a single-token position?
like image 447
Bram Vanroy Avatar asked Dec 21 '19 09:12

Bram Vanroy


Video Answer


2 Answers

Since the split word does not present in the dictionary, BERT is simply unaware of it's probability, so there is no use of masking it before tokenization.

And you can't get it's probability by exploiting rule of chain, see response by J.Devlin. To illustrate it, let's take more generic example. Try to estimate the probability of some bigram in position i. While you can estimate probability of each word given the sentence and their positions

P(w_i|w_0, w_1... w_i-1, w_i+1, ..., w_N),

P(w_i+1|w_0, w_1... w_i, wi+2, ..., w_N),

there is no way to get the probability of the bigram

P(w_i,w_i+1|w_0, w_1... w_i-1, wi+2, ..., w_N)

because BERT does not store such information.

Having said all that, you can get a very rough estimate of the probability of your OOV word by multiplying probabilities of seeing it's parts. So you will get

P("reprimand"|...) ~= P("rep"|...)*P("##rim"|...)*P("##and"|...)

Since your subwords are not regular words, but a special kind of words, this is not all wrong, because the dependency between them is implicit.

like image 178
igrinis Avatar answered Oct 24 '22 20:10

igrinis


Instead of sentence = f"I {tokenizer.mask_token} you", predict on: "I [MASK] [MASK] you" and "I [MASK] [MASK] [MASK] you" and filter results, dropping whole word token chains, so that you find only chains of suitable subwords. Of course you're going to get better results if you provide more than two surrounding context words.

But before you embark on that, reconsider your softmax. With dimension=0, it does a softmax calculation across all the token columns and all the token rows--not just the single token for which you want the softmax probability:

In [1]: import torch                                                                                                                      
In [2]: m = torch.nn.Softmax(dim=1) 
   ...: input = torch.randn(2, 3) 
   ...: input                                                                                                                        
Out[2]: 
tensor([[ 1.5542,  0.3776, -0.8047],
        [-0.3856,  1.1327, -0.1252]])

In [3]: m(input)                                                                                                                          
Out[3]: 
tensor([[0.7128, 0.2198, 0.0674],
        [0.1457, 0.6652, 0.1891]])

In [4]: soft = torch.nn.Softmax(dim=0) 
   ...: soft(input)                                                                                                                       
Out[4]: 
tensor([[0.8743, 0.3197, 0.3364],
        [0.1257, 0.6803, 0.6636]])
like image 23
Todd Cook Avatar answered Oct 24 '22 20:10

Todd Cook