Meaning: Bilingual Evaluation Understudy.

Measure: how well a generated text compares to a reference output.

Useful: similar to ROUGE, when the goal is to ensure that as much relevant information as possible from the reference summary is included in the generated summary. (how much of the reference content is captured by the generated summary). This is important in contexts where missing critical information can be more detrimental than including some extraneous information. BLEU is mainly used for translation tasks.

BLEU’s output is always a number between 0 and 1. This value indicates how similar the candidate text is to the reference texts, with values closer to 1 representing more similar texts. 1 indicates that the generated text is identical to the reference.

BLEU is based on precision.

Intuition and Calculation

This is a good tutorial on how to calculate BLEU step by step: https://medium.com/nlplanet/two-minutes-nlp-learn-the-bleu-metric-by-examples-df015ca73a86

Observations

This metric has multiple known limitations: (2)

  • BLEU compares overlap in tokens from the predictions and references, instead of comparing meaning. This can lead to discrepancies between BLEU scores and human ratings.
  • Shorter predicted translations achieve higher scores than longer ones, simply due to how the score is calculated. A brevity penalty is introduced to attempt to counteract this.
  • BLEU scores are not comparable across different datasets, nor are they comparable across different languages.
  • BLEU scores can vary greatly depending on which parameters are used to generate the scores, especially when different tokenization and normalization techniques are used. It is therefore not possible to compare BLEU scores generated using different parameters, or when these parameters are unknown. For more discussion around this topic, see the following issue.
  • We need to consider different tokenizations and/or preprocessing steps according to the language.

Thresholds

For these type of metrics, it is difficult to establish a threshold, but we can have some references in mind:

  • The original BLEU paper compares BLEU scores of five different models on the same 500-sentence corpus. These scores ranged from 0.0527 to 0.2571.

  • In the Attention is all you need paper, the transformer architecture got a BLEU score of 0.284 on the WMT 2014 English-to-German translation task, and 0.41 on the WMT 2014 English-to-French translation task.

References

  1. Chiusano, Fabio. 2002. Learn BLEU metric by examples.
  2. Hugging Face Implementation.
  3. Papineni et al. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation.
  4. Vaswani et al. 2017. Attention is all you need.