Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Variation in BLEU Score

I have some question on BLUE Score calculation for machine translation. I realized they may have a different metrics for BLEU. I found the code reports five value for BLEU, namely BLEU-1, BLEU-2, BLEU-3, BLEU-4 and finally BLEU, which seems to be an exponential average of the previous four BLEUs. Still it is not clear to me what the difference between those is. Do you have any ideas? Thanks

P.s. At first I thought that this question is more of a theoretical content and posted it on meta stackexange. A moderator has closed and commented it as a stackoverflow type question . So please don't punish me again. =)

like image 429
Jürgen K. Avatar asked Jun 02 '17 08:06

Jürgen K.


People also ask

What is a good BLEU 1 score?

Bleu Scores are between 0 and 1. A score of 0.6 or 0.7 is considered the best you can achieve. Even two humans would likely come up with different sentence variants for a problem, and would rarely achieve a perfect match.

Can we compare BLEU scores across language pairs?

A comparison between BLEU scores is only justifiable when BLEU results are compared with the same Test set, the same language pair, and the same MT engine. A BLEU score from a different test set is bound to be different.

What are the shortcomings of the BLEU score?

For a BLEU score, an error is just that: an error. In real life, if a word is placed incorrectly within a sentence, it can change its entire meaning. BLEU does not take the gravity of errors into consideration. Generally speaking, the score is not suitable (and was never intended) to evaluate machine translations.


2 Answers

source: http://www.statmt.org/book/slides/08-evaluation.pdf

I haven't heard of BLEU-1 and BLEU-2 but I guess it means 1-gram, 2-gram, 3-gram and 4-gram in the formula of BLEU score, I mean in the formula precision[i] = BLEU-i in your question:
enter image description here

like image 36
Iman Mirzadeh Avatar answered Oct 21 '22 11:10

Iman Mirzadeh


Actually, BLEU-n doesn't use the n-gram scores only. It computes the 1-gram through n-gram scores and gives them equal weight to compute a final score. See the "Cumulative N-Gram Scores" section at this link for more info.

like image 70
Tara Eicher Avatar answered Oct 21 '22 11:10

Tara Eicher