How do I use BertForMaskedLM or BertModel to calculate perplexity of a sentence?

Tags:

I want to use BertForMaskedLM or BertModel to calculate perplexity of a sentence, so I write code like this:

import numpy as np
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertForMaskedLM
# Load pre-trained model (weights)
with torch.no_grad():
    model = BertForMaskedLM.from_pretrained('hfl/chinese-bert-wwm-ext')
    model.eval()
    # Load pre-trained model tokenizer (vocabulary)
    tokenizer = BertTokenizer.from_pretrained('hfl/chinese-bert-wwm-ext')
    sentence = "我不会忘记和你一起奋斗的时光。"
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    sen_len = len(tokenize_input)
    sentence_loss = 0.

    for i, word in enumerate(tokenize_input):
        # add mask to i-th character of the sentence
        tokenize_input[i] = '[MASK]'
        mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])

        output = model(mask_input)

        prediction_scores = output[0]
        softmax = nn.Softmax(dim=0)
        ps = softmax(prediction_scores[0, i]).log()
        word_loss = ps[tensor_input[0, i]]
        sentence_loss += word_loss.item()

        tokenize_input[i] = word
    ppl = np.exp(-sentence_loss/sen_len)
    print(ppl)

I think this code is right, but I also notice BertForMaskedLM's paramaters masked_lm_labels, so could I use this paramaters to calculate PPL of a sentence easiler? I know the input_ids argument is the masked input, the masked_lm_labels argument is the desired output. But I couldn't understand the actual meaning of its output loss, its code like this:

if masked_lm_labels is not None:
    loss_fct = CrossEntropyLoss()  # -100 index = padding token
    masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), 
    masked_lm_labels.view(-1))
    outputs = (masked_lm_loss,) + outputs

948

asked Jul 22 '20 09:07

Kaim hong

1 Answers

Yes, you can use the parameter labels (or masked_lm_labels, I think the param name varies in versions of huggingface transformers, whatever) to specify the masked token position, and use -100 to ignore the tokens that you dont want to include in the loss computing. For example,

sentence='我爱你'
from transformers import BertTokenizer, BertForMaskedLM
import torch
import numpy as np

tokenizer = BertTokenizer(vocab_file='vocab.txt')
model = BertForMaskedLM.from_pretrained('bert-base-chinese')

tensor_input = tokenizer(sentence, return_tensors='pt')
# tensor([[ 101, 2769, 4263,  872,  102]])

repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
# tensor([[ 101, 2769, 4263,  872,  102],
#         [ 101, 2769, 4263,  872,  102],
#         [ 101, 2769, 4263,  872,  102]])

mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
# tensor([[0., 1., 0., 0., 0.],
#         [0., 0., 1., 0., 0.],
#         [0., 0., 0., 1., 0.]])

masked_input = repeat_input.masked_fill(mask == 1, 103)
# tensor([[ 101,  103, 4263,  872,  102],
#         [ 101, 2769,  103,  872,  102],
#         [ 101, 2769, 4263,  103,  102]])

labels = repeat_input.masked_fill( masked_input != 103, -100)
# tensor([[-100, 2769, -100, -100, -100],
#         [-100, -100, 4263, -100, -100],
#         [-100, -100, -100,  872, -100]])

loss,_ = model(masked_input, masked_lm_labels=labels)

score = np.exp(loss.item())

The function:

def score(model, tokenizer, sentence,  mask_token_id=103):
  tensor_input = tokenizer.encode(sentence, return_tensors='pt')
  repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
  mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
  masked_input = repeat_input.masked_fill(mask == 1, 103)
  labels = repeat_input.masked_fill( masked_input != 103, -100)
  loss,_ = model(masked_input, masked_lm_labels=labels)
  result = np.exp(loss.item())
  return result

score(model, tokenizer, '我爱你') # returns 45.63794545581973

171

answered Nov 08 '22 20:11

emily

Related questions
                            
                                Document Layout Analysis for text extraction
                            
                                Extracting nouns from Noun Phase in NLP
                            
                                How can I tweak Levenshtein distance in classifying linguistically similar words (e.g. verb tenses, adjective comparisons, singular and plural)
                            
                                C++ Sentiment Analysis Library [closed]
                            
                                Intelligent spell checking
                            
                                Interesting NLP/machine-learning style project -- analyzing privacy policies
                            
                                How google recognises 2 words without spaces?
                            
                                Counting with scipy.sparse
                            
                                How do I use the book functions (e.g. concoordance) in NLTK?
                            
                                What does the dependency-parse output of TurboParser mean?
                            
                                how to automatically detect acronym meaning / extension
                            
                                Sentence annotation in text without punctuation
                            
                                Chunking NP, VP and PP phrases in Java (CoreNLP)
                            
                                Sentence tokenization for texts that contains quotes
                            
                                NLTK - WordNet: list of long words
                            
                                How to extract Predicate and subject from a sentence using NLP Libraries?
                            
                                Using predict on new text with kmeans (sklearn)?
                            
                                NLP, spaCy: Strategy for improving document similarity
                            
                                How can I create and fit vocab.bpe file (GPT and GPT2 OpenAI models) with my own corpus text?
                            
                                Extracting a person's age from unstructured text in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I use BertForMaskedLM or BertModel to calculate perplexity of a sentence?

Tags:

nlp

pytorch

bert-language-model

huggingface-transformers

transformer

transformer-model

Kaim hong

People also ask

1 Answers

emily

Recent Activity

Donate For Us