I have imported nltk in python to calculate BLEU Score on Ubuntu. I understand how sentence-level BLEU score works, but I don't understand how corpus-level BLEU score work.
Below is my code for corpus-level BLEU score:
import nltk
hypothesis = ['This', 'is', 'cat']
reference = ['This', 'is', 'a', 'cat']
BLEUscore = nltk.translate.bleu_score.corpus_bleu([reference], [hypothesis], weights = [1])
print(BLEUscore)
For some reason, the bleu score is 0 for the above code. I was expecting a corpus-level BLEU score of at least 0.5.
Here is my code for sentence-level BLEU score
import nltk
hypothesis = ['This', 'is', 'cat']
reference = ['This', 'is', 'a', 'cat']
BLEUscore = nltk.translate.bleu_score.sentence_bleu([reference], hypothesis, weights = [1])
print(BLEUscore)
Here the sentence-level BLEU score is 0.71 which I expect, taking into account the brevity-penalty and the missing word "a". However, I don't understand how corpus-level BLEU score work.
Any help would be appreciated.
Corpus BLEU Score The references must be specified as a list of documents where each document is a list of references and each alternative reference is a list of tokens, e.g. a list of lists of lists of tokens.
Finally, to calculate the Bleu Score, we multiply the Brevity Penalty with the Geometric Average of the Precision Scores. Bleu Score can be computed for different values of N. Typically, we use N = 4.
Therefore you cannot compute Bleu Score separately on each sentence in the corpus, and then average those scores in some way. The reason that Bleu Score is so popular is that it has several strengths:
The sentence-level BLEU can use different smoothing techniques that should ensure that score would get reasonable values even if 3-gram of 4-gram precision would be zero. However, note that BLEU as a sentence-level metric is very unreliable. Thanks for contributing an answer to Stack Overflow!
Automatic Speech Recognition (Speech-to-Text algorithm and architecture, using CTC Loss and Decoding for aligning sequences.) Over the years a number of different NLP metrics have been developed to tackle this problem. One of the most popular is called the Bleu Score. It is far from perfect, and it has many drawbacks.
Therefore the BLEU score evaluates to 0, independently of how many N-gram overlaps of lower order it contains. Consider using lower n-gram order or use SmoothingFunction () warnings.warn (_msg) /home/mac/.local/lib/python3.6/site-packages/nltk/translate/bleu_score.py:516: UserWarning: The hypothesis contains 0 counts of 4-gram overlaps.
TL;DR:
>>> import nltk
>>> hypothesis = ['This', 'is', 'cat']
>>> reference = ['This', 'is', 'a', 'cat']
>>> references = [reference] # list of references for 1 sentence.
>>> list_of_references = [references] # list of references for all sentences in corpus.
>>> list_of_hypotheses = [hypothesis] # list of hypotheses that corresponds to list of references.
>>> nltk.translate.bleu_score.corpus_bleu(list_of_references, list_of_hypotheses)
0.6025286104785453
>>> nltk.translate.bleu_score.sentence_bleu(references, hypothesis)
0.6025286104785453
(Note: You have to pull the latest version of NLTK on the develop
branch in order to get a stable version of the BLEU score implementation)
In Long:
Actually, if there's only one reference and one hypothesis in your whole corpus, both corpus_bleu()
and sentence_bleu()
should return the same value as shown in the example above.
In the code, we see that sentence_bleu
is actually a duck-type of corpus_bleu
:
def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25),
smoothing_function=None):
return corpus_bleu([references], [hypothesis], weights, smoothing_function)
And if we look at the parameters for sentence_bleu
:
def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25),
smoothing_function=None):
""""
:param references: reference sentences
:type references: list(list(str))
:param hypothesis: a hypothesis sentence
:type hypothesis: list(str)
:param weights: weights for unigrams, bigrams, trigrams and so on
:type weights: list(float)
:return: The sentence-level BLEU score.
:rtype: float
"""
The input for sentence_bleu
's references is a list(list(str))
.
So if you have a sentence string, e.g. "This is a cat"
, you have to tokenized it to get a list of strings, ["This", "is", "a", "cat"]
and since it allows for multiple references, it has to be a list of list of string, e.g. if you have a second reference, "This is a feline", your input to sentence_bleu()
would be:
references = [ ["This", "is", "a", "cat"], ["This", "is", "a", "feline"] ]
hypothesis = ["This", "is", "cat"]
sentence_bleu(references, hypothesis)
When it comes to corpus_bleu()
list_of_references parameter, it's basically a list of whatever the sentence_bleu()
takes as references:
def corpus_bleu(list_of_references, hypotheses, weights=(0.25, 0.25, 0.25, 0.25),
smoothing_function=None):
"""
:param references: a corpus of lists of reference sentences, w.r.t. hypotheses
:type references: list(list(list(str)))
:param hypotheses: a list of hypothesis sentences
:type hypotheses: list(list(str))
:param weights: weights for unigrams, bigrams, trigrams and so on
:type weights: list(float)
:return: The corpus-level BLEU score.
:rtype: float
"""
Other than look at the doctest within the nltk/translate/bleu_score.py
, you can also take a look at the unittest at nltk/test/unit/translate/test_bleu_score.py
to see how to use each of the component within the bleu_score.py
.
By the way, since the sentence_bleu
is imported as bleu
in the (nltk.translate.__init__.py
](https://github.com/nltk/nltk/blob/develop/nltk/translate/init.py#L21), using
from nltk.translate import bleu
would be the same as:
from nltk.translate.bleu_score import sentence_bleu
and in code:
>>> from nltk.translate import bleu
>>> from nltk.translate.bleu_score import sentence_bleu
>>> from nltk.translate.bleu_score import corpus_bleu
>>> bleu == sentence_bleu
True
>>> bleu == corpus_bleu
False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With