Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between mteval-v13a.pl and NLTK BLEU?

There is an implementation of BLEU score in Python NLTK, nltk.translate.bleu_score.corpus_bleu

But I am not sure if it is the same as the mtevalv13a.pl script.

What is the difference between them?

like image 856
Ssamu Vut Avatar asked Sep 06 '17 21:09

Ssamu Vut


1 Answers

TL;DR

Use https://github.com/mjpost/sacrebleu when evaluating Machine Translation systems.

In Short

No, the BLEU in NLTK isn't the exactly the same as the mteval-13a.perl.

But it can get really close, see https://github.com/nltk/nltk/issues/1330#issuecomment-256237324

nltk.translate.corpus_bleu corresponds to mteval-13a.pl up to the 4th order of ngram with some floating point discrepancies

The details of the comparison and the dataset used can be downloaded from https://github.com/nltk/nltk_data/blob/gh-pages/packages/models/wmt15_eval.zip or:

import nltk
nltk.download('wmt15_eval')

The major differences:

enter image description here


In Long

There are several difference between mteval-13a.pl and nltk.translate.corpus_bleu:

  • The first difference is the fact that mteval-13a.pl comes with its own NIST tokenizer while the NLTK version of BLEU is the implementation of the metric and assumes that input is pre-tokenized.

    • BTW, this ongoing PR will bridge the gap between NLTK and NIST tokenizers
  • The other major difference is that mteval-13a.pl expects the input to be in .sgm format while NLTK BLEU takes in python list of lists of strings, see the README.txt in the zipball here for more information of how to convert textfile to SGM.

  • mteval-13a.pl expects an ngram order of at least 1-4. If the minimum ngram order for the sentence/corpus is less than 4, it will return a 0 probability which is a math.log(float('-inf')). To emulate this behavior, NLTK has a put an _emulate_multibleu flag:

    • See https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L477
  • mteval-13a.pl is able to generate NIST scores while NLTK doesn't have NIST score implementation (at least not yet)

    • NIST score in NLTK is upcoming in this PR

Other than the differences, NLTK BLEU scores packed in more features:

  • to handle fringe cases that the original BLEU (Papineni, ‎2002) overlooked

    • See https://github.com/nltk/nltk/pull/1383
  • Also to handle fringe cases where the largest order of Ngram is < 4, the uniform weights of the individual ngram precision will be reweighted such that the mass of the weights sums to 1.0

    • See https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L175
  • while NIST has a smoothing method for geometric sequence smoothing, NLTK has an equivalent object with the same smoothing method and even more smoothing methods to handle sentence level BLEU from Chen and Collin, 2014

Lastly to validate the features added in NLTK's version of BLEU, a regression test is added to accounts for them, see https://github.com/nltk/nltk/blob/develop/nltk/test/unit/translate/test_bleu.py

like image 57
alvas Avatar answered Oct 25 '22 14:10

alvas