Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK: corpus-level bleu vs sentence-level BLEU score

I have imported nltk in python to calculate BLEU Score on Ubuntu. I understand how sentence-level BLEU score works, but I don't understand how corpus-level BLEU score work.

Below is my code for corpus-level BLEU score:

import nltk

hypothesis = ['This', 'is', 'cat'] 
reference = ['This', 'is', 'a', 'cat']
BLEUscore = nltk.translate.bleu_score.corpus_bleu([reference], [hypothesis], weights = [1])
print(BLEUscore)

For some reason, the bleu score is 0 for the above code. I was expecting a corpus-level BLEU score of at least 0.5.

Here is my code for sentence-level BLEU score

import nltk

hypothesis = ['This', 'is', 'cat'] 
reference = ['This', 'is', 'a', 'cat']
BLEUscore = nltk.translate.bleu_score.sentence_bleu([reference], hypothesis, weights = [1])
print(BLEUscore)

Here the sentence-level BLEU score is 0.71 which I expect, taking into account the brevity-penalty and the missing word "a". However, I don't understand how corpus-level BLEU score work.

Any help would be appreciated.

like image 621
Long Le Minh Avatar asked Nov 11 '16 06:11

Long Le Minh


People also ask

What is corpus BLEU?

Corpus BLEU Score The references must be specified as a list of documents where each document is a list of references and each alternative reference is a list of tokens, e.g. a list of lists of lists of tokens.

How Bleu score is calculated?

Finally, to calculate the Bleu Score, we multiply the Brevity Penalty with the Geometric Average of the Precision Scores. Bleu Score can be computed for different values of N. Typically, we use N = 4.

Can I compute BLEU score from each sentence in a corpus?

Therefore you cannot compute Bleu Score separately on each sentence in the corpus, and then average those scores in some way. The reason that Bleu Score is so popular is that it has several strengths:

Can Bleu be used as a sentence level metric?

The sentence-level BLEU can use different smoothing techniques that should ensure that score would get reasonable values even if 3-gram of 4-gram precision would be zero. However, note that BLEU as a sentence-level metric is very unreliable. Thanks for contributing an answer to Stack Overflow!

What is the BLEU score in NLP?

Automatic Speech Recognition (Speech-to-Text algorithm and architecture, using CTC Loss and Decoding for aligning sequences.) Over the years a number of different NLP metrics have been developed to tackle this problem. One of the most popular is called the Bleu Score. It is far from perfect, and it has many drawbacks.

How many n-gram overlaps does the BLEU score have?

Therefore the BLEU score evaluates to 0, independently of how many N-gram overlaps of lower order it contains. Consider using lower n-gram order or use SmoothingFunction () warnings.warn (_msg) /home/mac/.local/lib/python3.6/site-packages/nltk/translate/bleu_score.py:516: UserWarning: The hypothesis contains 0 counts of 4-gram overlaps.


1 Answers

TL;DR:

>>> import nltk
>>> hypothesis = ['This', 'is', 'cat'] 
>>> reference = ['This', 'is', 'a', 'cat']
>>> references = [reference] # list of references for 1 sentence.
>>> list_of_references = [references] # list of references for all sentences in corpus.
>>> list_of_hypotheses = [hypothesis] # list of hypotheses that corresponds to list of references.
>>> nltk.translate.bleu_score.corpus_bleu(list_of_references, list_of_hypotheses)
0.6025286104785453
>>> nltk.translate.bleu_score.sentence_bleu(references, hypothesis)
0.6025286104785453

(Note: You have to pull the latest version of NLTK on the develop branch in order to get a stable version of the BLEU score implementation)


In Long:

Actually, if there's only one reference and one hypothesis in your whole corpus, both corpus_bleu() and sentence_bleu() should return the same value as shown in the example above.

In the code, we see that sentence_bleu is actually a duck-type of corpus_bleu:

def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25),
                  smoothing_function=None):
    return corpus_bleu([references], [hypothesis], weights, smoothing_function)

And if we look at the parameters for sentence_bleu:

 def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25),
                      smoothing_function=None):
    """"
    :param references: reference sentences
    :type references: list(list(str))
    :param hypothesis: a hypothesis sentence
    :type hypothesis: list(str)
    :param weights: weights for unigrams, bigrams, trigrams and so on
    :type weights: list(float)
    :return: The sentence-level BLEU score.
    :rtype: float
    """

The input for sentence_bleu's references is a list(list(str)).

So if you have a sentence string, e.g. "This is a cat", you have to tokenized it to get a list of strings, ["This", "is", "a", "cat"] and since it allows for multiple references, it has to be a list of list of string, e.g. if you have a second reference, "This is a feline", your input to sentence_bleu() would be:

references = [ ["This", "is", "a", "cat"], ["This", "is", "a", "feline"] ]
hypothesis = ["This", "is", "cat"]
sentence_bleu(references, hypothesis)

When it comes to corpus_bleu() list_of_references parameter, it's basically a list of whatever the sentence_bleu() takes as references:

def corpus_bleu(list_of_references, hypotheses, weights=(0.25, 0.25, 0.25, 0.25),
                smoothing_function=None):
    """
    :param references: a corpus of lists of reference sentences, w.r.t. hypotheses
    :type references: list(list(list(str)))
    :param hypotheses: a list of hypothesis sentences
    :type hypotheses: list(list(str))
    :param weights: weights for unigrams, bigrams, trigrams and so on
    :type weights: list(float)
    :return: The corpus-level BLEU score.
    :rtype: float
    """

Other than look at the doctest within the nltk/translate/bleu_score.py, you can also take a look at the unittest at nltk/test/unit/translate/test_bleu_score.py to see how to use each of the component within the bleu_score.py.

By the way, since the sentence_bleu is imported as bleu in the (nltk.translate.__init__.py](https://github.com/nltk/nltk/blob/develop/nltk/translate/init.py#L21), using

from nltk.translate import bleu 

would be the same as:

from nltk.translate.bleu_score import sentence_bleu

and in code:

>>> from nltk.translate import bleu
>>> from nltk.translate.bleu_score import sentence_bleu
>>> from nltk.translate.bleu_score import corpus_bleu
>>> bleu == sentence_bleu
True
>>> bleu == corpus_bleu
False
like image 55
alvas Avatar answered Sep 18 '22 13:09

alvas