I have some question on BLUE Score calculation for machine translation. I realized they may have a different metrics for BLEU. I found the code reports five value for BLEU, namely BLEU-1, BLEU-2, BLEU-3, BLEU-4 and finally BLEU, which seems to be an exponential average of the previous four BLEUs. Still it is not clear to me what the difference between those is. Do you have any ideas? Thanks
P.s. At first I thought that this question is more of a theoretical content and posted it on meta stackexange. A moderator has closed and commented it as a stackoverflow type question . So please don't punish me again. =)
Bleu Scores are between 0 and 1. A score of 0.6 or 0.7 is considered the best you can achieve. Even two humans would likely come up with different sentence variants for a problem, and would rarely achieve a perfect match.
A comparison between BLEU scores is only justifiable when BLEU results are compared with the same Test set, the same language pair, and the same MT engine. A BLEU score from a different test set is bound to be different.
For a BLEU score, an error is just that: an error. In real life, if a word is placed incorrectly within a sentence, it can change its entire meaning. BLEU does not take the gravity of errors into consideration. Generally speaking, the score is not suitable (and was never intended) to evaluate machine translations.
source: http://www.statmt.org/book/slides/08-evaluation.pdf
I haven't heard of BLEU-1 and BLEU-2 but I guess it means 1-gram, 2-gram, 3-gram and 4-gram in the formula of BLEU score, I mean in the formula precision[i] = BLEU-i
in your question:
Actually, BLEU-n doesn't use the n-gram scores only. It computes the 1-gram through n-gram scores and gives them equal weight to compute a final score. See the "Cumulative N-Gram Scores" section at this link for more info.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With