Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use BertForMaskedLM or BertModel to calculate perplexity of a sentence?

I want to use BertForMaskedLM or BertModel to calculate perplexity of a sentence, so I write code like this:

import numpy as np
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertForMaskedLM
# Load pre-trained model (weights)
with torch.no_grad():
    model = BertForMaskedLM.from_pretrained('hfl/chinese-bert-wwm-ext')
    model.eval()
    # Load pre-trained model tokenizer (vocabulary)
    tokenizer = BertTokenizer.from_pretrained('hfl/chinese-bert-wwm-ext')
    sentence = "我不会忘记和你一起奋斗的时光。"
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    sen_len = len(tokenize_input)
    sentence_loss = 0.

    for i, word in enumerate(tokenize_input):
        # add mask to i-th character of the sentence
        tokenize_input[i] = '[MASK]'
        mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])

        output = model(mask_input)

        prediction_scores = output[0]
        softmax = nn.Softmax(dim=0)
        ps = softmax(prediction_scores[0, i]).log()
        word_loss = ps[tensor_input[0, i]]
        sentence_loss += word_loss.item()

        tokenize_input[i] = word
    ppl = np.exp(-sentence_loss/sen_len)
    print(ppl)

I think this code is right, but I also notice BertForMaskedLM's paramaters masked_lm_labels, so could I use this paramaters to calculate PPL of a sentence easiler? I know the input_ids argument is the masked input, the masked_lm_labels argument is the desired output. But I couldn't understand the actual meaning of its output loss, its code like this:

if masked_lm_labels is not None:
    loss_fct = CrossEntropyLoss()  # -100 index = padding token
    masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), 
    masked_lm_labels.view(-1))
    outputs = (masked_lm_loss,) + outputs
like image 948
Kaim hong Avatar asked Jul 22 '20 09:07

Kaim hong


People also ask

How do you calculate perplexity in a sentence?

As you said in your question, the probability of a sentence appear in a corpus, in a unigram model, is given by p(s)=∏ni=1p(wi), where p(wi) is the probability of the word wi occurs. We are done. And this is the perplexity of the corpus to the number of words.

What is perplexity in of language model?

Perplexity is the multiplicative inverse of the probability assigned to the test set by the language model, normalized by the number of words in the test set. If a language model can predict unseen words from the test set, i.e., the P(a sentence from a test set) is highest; then such a language model is more accurate.

Is Bert a language model?

BERT is an open source machine learning framework for natural language processing (NLP). BERT is designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish context.


1 Answers

Yes, you can use the parameter labels (or masked_lm_labels, I think the param name varies in versions of huggingface transformers, whatever) to specify the masked token position, and use -100 to ignore the tokens that you dont want to include in the loss computing. For example,

sentence='我爱你'
from transformers import BertTokenizer, BertForMaskedLM
import torch
import numpy as np

tokenizer = BertTokenizer(vocab_file='vocab.txt')
model = BertForMaskedLM.from_pretrained('bert-base-chinese')

tensor_input = tokenizer(sentence, return_tensors='pt')
# tensor([[ 101, 2769, 4263,  872,  102]])

repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
# tensor([[ 101, 2769, 4263,  872,  102],
#         [ 101, 2769, 4263,  872,  102],
#         [ 101, 2769, 4263,  872,  102]])

mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
# tensor([[0., 1., 0., 0., 0.],
#         [0., 0., 1., 0., 0.],
#         [0., 0., 0., 1., 0.]])

masked_input = repeat_input.masked_fill(mask == 1, 103)
# tensor([[ 101,  103, 4263,  872,  102],
#         [ 101, 2769,  103,  872,  102],
#         [ 101, 2769, 4263,  103,  102]])

labels = repeat_input.masked_fill( masked_input != 103, -100)
# tensor([[-100, 2769, -100, -100, -100],
#         [-100, -100, 4263, -100, -100],
#         [-100, -100, -100,  872, -100]])

loss,_ = model(masked_input, masked_lm_labels=labels)

score = np.exp(loss.item())

The function:

def score(model, tokenizer, sentence,  mask_token_id=103):
  tensor_input = tokenizer.encode(sentence, return_tensors='pt')
  repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
  mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
  masked_input = repeat_input.masked_fill(mask == 1, 103)
  labels = repeat_input.masked_fill( masked_input != 103, -100)
  loss,_ = model(masked_input, masked_lm_labels=labels)
  result = np.exp(loss.item())
  return result

score(model, tokenizer, '我爱你') # returns 45.63794545581973
like image 171
emily Avatar answered Nov 08 '22 20:11

emily