> Papineni, Kishore, et al. 'Bleu: A Method for Automatic Evaluation of Machine Translation'. _Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics_, Association for Computational Linguistics, 2002, pp. 311–18. _ACLWeb_, [https://doi.org/10.3115/1073083.1073135](https://doi.org/10.3115/1073083.1073135). # Bleu: A Method for Automatic Evaluation of Machine Translation The BLEU metric (Bi-Lingual Evaluation Understudy) compares a candidate translation with a reference translation. More specifically, it compares n-grams of the candidate with the n-grams of the reference and counts the number of matches, which are position-independent. ## Precision $ p_n = \frac{N_\textrm{matches}}{N_\textrm{n-grams}} $ The score itself is the _precision_, ie, the number of n-grams in the candidate which occur in any reference translation divided by the total number of words in the candidate. ## Modified precision $ p_n = \frac{N_\textrm{matches, clipped}}{N_\textrm{n-grams}} $ The number of matches of an n-gram in the candidate is clipped to its maximum reference count in any single reference. In the example below, the modified unigram precision is 2/7, since the word "the" appears only twice at most in any reference, and the modified bigram precision is 0. ![Example 1](https://i.imgur.com/NNbO5vj.png) Unigram matches tend to evaluate adequacy, while longer n-gram matches account for fluency. ## Text-wide modified precision To compute text-wide modified precision, we sum clipped n-gram matches over all candidate sentences, and divide by the sum of the numbers of n-grams per candidate. $ p_n = \frac{\displaystyle{\sum_{C \in \textrm{candidates}}} N_\textrm{matches, clipped}(C)}{\displaystyle{\sum_{C \in \textrm{candidates}}} N_\textrm{n-grams}(C)} $ ## Combining precisions BLEU uses the geometric mean of modified n-gram precisions, ie, the avearge of the logarithms. This choice is motivated by the empiric exponential decay of n-gram precisions with n-gram size. ## Sentence brevity penalty Candidates that are too short can erroneously get very high precision scores: the penalty is 1 when the candidate's length is the same as any reference's length. The penalty is computed over the entire corpus not to punish deviations on short sentences too harshly. We call the closest reference sentence length the _best match length_. $ \mathrm{BP} = \begin{cases} 1 & \text{if} & c > r \\ e^{1-\frac{r}{c}} & \text{if} & c \leq r \end{cases} $ where $r$ is the sum of the best match lengths for each sentence and $c$ the total length of the candidate corpus. ## BLEU score $ \mathrm{BLEU} = \mathrm{BP} \cdot \exp \left( \sum_{n=1}^N w_n \log p_n \right) $ In practice, Papineni used $N=4$ and uniform weights $w_n = 1/N$.