I want to use BertForMaskedLM or BertModel to calculate perplexity of a sentence, so I write code like this:
import numpy as np
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertForMaskedLM
# Load pre-trained model (weights)
with torch.no_grad():
model = BertForMaskedLM.from_pretrained('hfl/chinese-bert-wwm-ext')
model.eval()
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('hfl/chinese-bert-wwm-ext')
sentence = "我不会忘记和你一起奋斗的时光。"
tokenize_input = tokenizer.tokenize(sentence)
tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
sen_len = len(tokenize_input)
sentence_loss = 0.
for i, word in enumerate(tokenize_input):
# add mask to i-th character of the sentence
tokenize_input[i] = '[MASK]'
mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
output = model(mask_input)
prediction_scores = output[0]
softmax = nn.Softmax(dim=0)
ps = softmax(prediction_scores[0, i]).log()
word_loss = ps[tensor_input[0, i]]
sentence_loss += word_loss.item()
tokenize_input[i] = word
ppl = np.exp(-sentence_loss/sen_len)
print(ppl)
I think this code is right, but I also notice BertForMaskedLM's paramaters masked_lm_labels
, so could I use this paramaters to calculate PPL of a sentence easiler?
I know the input_ids argument is the masked input, the masked_lm_labels argument is the desired output. But I couldn't understand the actual meaning of its output loss, its code like this:
if masked_lm_labels is not None:
loss_fct = CrossEntropyLoss() # -100 index = padding token
masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size),
masked_lm_labels.view(-1))
outputs = (masked_lm_loss,) + outputs
As you said in your question, the probability of a sentence appear in a corpus, in a unigram model, is given by p(s)=∏ni=1p(wi), where p(wi) is the probability of the word wi occurs. We are done. And this is the perplexity of the corpus to the number of words.
Perplexity is the multiplicative inverse of the probability assigned to the test set by the language model, normalized by the number of words in the test set. If a language model can predict unseen words from the test set, i.e., the P(a sentence from a test set) is highest; then such a language model is more accurate.
BERT is an open source machine learning framework for natural language processing (NLP). BERT is designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish context.
Yes, you can use the parameter labels
(or masked_lm_labels
, I think the param name varies in versions of huggingface transformers, whatever) to specify the masked token position, and use -100
to ignore the tokens that you dont want to include in the loss computing.
For example,
sentence='我爱你'
from transformers import BertTokenizer, BertForMaskedLM
import torch
import numpy as np
tokenizer = BertTokenizer(vocab_file='vocab.txt')
model = BertForMaskedLM.from_pretrained('bert-base-chinese')
tensor_input = tokenizer(sentence, return_tensors='pt')
# tensor([[ 101, 2769, 4263, 872, 102]])
repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
# tensor([[ 101, 2769, 4263, 872, 102],
# [ 101, 2769, 4263, 872, 102],
# [ 101, 2769, 4263, 872, 102]])
mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
# tensor([[0., 1., 0., 0., 0.],
# [0., 0., 1., 0., 0.],
# [0., 0., 0., 1., 0.]])
masked_input = repeat_input.masked_fill(mask == 1, 103)
# tensor([[ 101, 103, 4263, 872, 102],
# [ 101, 2769, 103, 872, 102],
# [ 101, 2769, 4263, 103, 102]])
labels = repeat_input.masked_fill( masked_input != 103, -100)
# tensor([[-100, 2769, -100, -100, -100],
# [-100, -100, 4263, -100, -100],
# [-100, -100, -100, 872, -100]])
loss,_ = model(masked_input, masked_lm_labels=labels)
score = np.exp(loss.item())
The function:
def score(model, tokenizer, sentence, mask_token_id=103):
tensor_input = tokenizer.encode(sentence, return_tensors='pt')
repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
masked_input = repeat_input.masked_fill(mask == 1, 103)
labels = repeat_input.masked_fill( masked_input != 103, -100)
loss,_ = model(masked_input, masked_lm_labels=labels)
result = np.exp(loss.item())
return result
score(model, tokenizer, '我爱你') # returns 45.63794545581973
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With