There is an implementation of BLEU score in Python NLTK,
nltk.translate.bleu_score.corpus_bleu
But I am not sure if it is the same as the mtevalv13a.pl script.
What is the difference between them?
TL;DR
Use https://github.com/mjpost/sacrebleu when evaluating Machine Translation systems.
In Short
No, the BLEU in NLTK isn't the exactly the same as the mteval-13a.perl
.
But it can get really close, see https://github.com/nltk/nltk/issues/1330#issuecomment-256237324
nltk.translate.corpus_bleu
corresponds tomteval-13a.pl
up to the 4th order of ngram with some floating point discrepancies
The details of the comparison and the dataset used can be downloaded from https://github.com/nltk/nltk_data/blob/gh-pages/packages/models/wmt15_eval.zip or:
import nltk
nltk.download('wmt15_eval')
The major differences:
In Long
There are several difference between mteval-13a.pl
and nltk.translate.corpus_bleu
:
The first difference is the fact that mteval-13a.pl
comes with its own NIST tokenizer while the NLTK version of BLEU is the implementation of the metric and assumes that input is pre-tokenized.
The other major difference is that mteval-13a.pl
expects the input to be in .sgm
format while NLTK BLEU takes in python list of lists of strings, see the README.txt in the zipball here for more information of how to convert textfile to SGM.
mteval-13a.pl
expects an ngram order of at least 1-4. If the minimum ngram order for the sentence/corpus is less than 4, it will return a 0 probability which is a math.log(float('-inf'))
. To emulate this behavior, NLTK has a put an _emulate_multibleu
flag:
mteval-13a.pl
is able to generate NIST scores while NLTK doesn't have NIST score implementation (at least not yet)
Other than the differences, NLTK BLEU scores packed in more features:
to handle fringe cases that the original BLEU (Papineni, 2002) overlooked
Also to handle fringe cases where the largest order of Ngram is < 4, the uniform weights of the individual ngram precision will be reweighted such that the mass of the weights sums to 1.0
while NIST has a smoothing method for geometric sequence smoothing, NLTK has an equivalent object with the same smoothing method and even more smoothing methods to handle sentence level BLEU from Chen and Collin, 2014
Lastly to validate the features added in NLTK's version of BLEU, a regression test is added to accounts for them, see https://github.com/nltk/nltk/blob/develop/nltk/test/unit/translate/test_bleu.py
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With