ROUGE Metrics | Lore Méndez

Meaning: Recall-Oriented Understudy for Gisting Evaluation.

Measure: how well a generated text compares to a reference output.

Useful: when the goal is to ensure that as much relevant information as possible from the reference summary is included in the generated summary. (how much of the reference content is captured by the generated summary). This is important in contexts where missing critical information can be more detrimental than including some extraneous information.

Intuition and Construction

Imagine you want to assess how a summary (or a generated text) compares to a reference text.

REFERENCE TEXT: I love to eat ice cream
GENERATED TEXT: I love to eat

First, we would like to see if key words are included. For which we need to identify the words in each text.

REFERENCE TEXT WORDS: [I], [love], [to], [eat], [ice], [cream]
GENERATED TEXT WORDS:  [I], [love], [to], [eat]

Then, we will compute precision and recall scores:

recall = (# overlapping words)/(# words in the reference) = 4/6 = 0,6667
precision = (# overlapping words)/(# words in the candidate) = 4/4 = 1

And finally combine them, in an overall score (F1-Score) also known as ROUGE-1 Score:

ROUGE-1 = (2 * precision * recall)/(precision + recall) = (2 * 1 * 0,6667)/(1 + 0,6667) = 0,8

It is called ROUGE-1, because we are comparing words, which are also known as 1-grams (unigrams).

If instead of comparing single words, we compare every pair of 2 words, also known as 2- grams (bigrams), then it is called the ROUGE-2.

REFERENCE TEXT BIGRAMS: [I love], [love to], [to eat], [eat ice], [ice cream]
GENERATED TEXT BIGRAMS:  [I love], [love to], [to eat]

recall = (# overlapping bigrams)/(# bigrams in the reference) = 3/5 = 0,6
precision = (# overlapping bigrams)/(# bigrams in the candidate) = 3/3 = 1
ROUGE-2 = (2 * precision * recall)/(precision + recall) = (2 * 1 * 0,6)/(1 + 0,6) = 0,75

And if we compare N-grams following the same logic, then it is called the ROUGE-N Score.

Calculation of ROUGE-N

Get the N-grams of the reference and the generated answer.
Calculate the recall an precision:
- recall = (overlapping number of N-grams) / (number of N-grams in the reference)
- precision = (overlapping number of N-grams) / (number of N-grams in the candidate)
Calculate the F1 score, this is the ROUGE-N:
- F1 = (2 * precision * recall) / (precision + recall)

You can calculate the rouge metrics in python with the library rouge-score.

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2','rougeL'], use_stemmer=False)

scores = scorer.score(reference, gen_answer)

Variants

ROUGE-1: measures the overlap of unigrams (individual words) between the generated summary and the reference summary. Higher scores indicate better coverage of key concepts.

ROUGE-2: measures the overlap of bigrams (two consecutive words). This score is more stringent and better reflects the fluency and coherence of the generated summary.

ROUGE-L: measures the longest common subsequence, which takes into account sentence-level structure and order.

ROUGE-L Sum:

is a variant of ROUGE-L specifically designed to handle multi-document summarization. Evaluates the quality of summaries generated from multiple source documents.
instead of calculating the LCS for individual sentences or documents, ROUGE-L Sum aggregates the LCS values across all reference summaries and the generated summary.
provides a more comprehensive measure of summary quality by considering the collective content coverage from multiple sources.

ROUGE-W:

introduces a weighted variant of the Longest Common Subsequence (LCS) measure.
assigns higher weights to longer LCS matches, thereby emphasizing longer contiguous sequences of words that are shared between the generated summary and the reference summary.
useful in tasks where capturing longer sequences of words is crucial for summary coherence and informativeness, such as in technical document summarization or summarizing scientific articles.

ROUGE-S:

introduces the concept of skip-bigrams, which allows for gaps (skips) between words in the matching sequences. This is in contrast to standard n-grams (like unigrams and bigrams), which strictly require consecutive words.
less sensitive to minor syntactic differences between the generated summary and the reference summary. It can better capture semantic similarity and paraphrasing in the generated summary.
provides a measure of how well the generated summary captures the essential content and structure of the reference summary.
useful in tasks where the generated summaries may vary in word order or use paraphrases to convey the same meaning as the reference, such as in summarizing opinionated texts or reviews.

(Many) Observations

ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters.
ROUGE favors extractive summarization methods (which directly copy parts of the source text) over abstractive summarization methods (which generate new sentences), as it measures n-gram overlap.
It is impossible to obtain 100% ROUGE scores unless we compare the exact same text. In fact, there is no definitive upper bound for ROUGE scores, making very difficult to determine the quality of a summary by using this metric only (4).
Different domains (e.g., news, scientific papers, medical texts) might have different benchmarks. It’s essential to look at existing literature and benchmarks within the specific domain to set appropriate thresholds. See the table 1 of (4) as an example, where the summaries of news achieved lower scores than Court docs.
When comparing different summarization models, the relative ROUGE scores are more critical than absolute thresholds.
Consistently higher scores across multiple ROUGE metrics indicate a better-performing model.
While higher scores generally indicate better performance, the interpretation of ROUGE metrics should consider their limitations, such as sensitivity to sentence structure and word order.
Due to the nature of the different languages, it is possible that sometimes you need to implement additional pre-processing. (7)
ROUGE-L is the variant that may have more sense to evaluate RAG. (8)

Now, let’s take a look into the following table:

Generated Text	ROUGE-1	ROUGE-2
Ice cream is my favorite food ever	0,31	0,18
Ice cream is my favorite food	0,33	0,20
Ice cream is food	0,40	0,25
Ice cream	0,50	0,33
Ice	0,29	0,00
I hate to eat ice cream	0,83	0,60
I hate to eat	0,60	0,25
I love to eat	0,80	0,75

“I hate to eat ice cream” scores the highest ROUGE-1 score, indicating it captures the most critical part of the reference sentence. This is a very concise summary focusing on the core concept, although the meaning is completely the opposite.
As the candidate summaries become longer and introduce more words not present in the reference, the ROUGE scores generally decrease, indicating a lower degree of overlap and relevance.
The scores suggest that brevity combined with key term retention (e.g., “Ice cream”) results in higher ROUGE scores, especially when the reference sentence is short and focused.

Thresholds

In (4), we can observe optimized ROUGE scores for the three different datasets: duc04 (news), wiki, echr (judgement summary from the european court). Although the ROUGE metrics correlate with Human evaluation (5) (6) , establish a threshold depends on if we want a more abstractive or extractive summary, which of course depends also usually on the type of content we are dealing with (4).

However, as a general guidance, if we take the Document Understanding Conference (DUC) Results 2004 (3) and the two aforementioned papers as a baseline, we could consider the following as good:

ROUGE-1 : (∼ 0.4)
ROUGE-2 : (∼ 0.2)
ROUGE-L : (∼ 0.3)

For german language see

WARNING

Different domains (e.g., news, scientific papers, medical texts) might have different benchmarks. It’s essential to look at existing literature and benchmarks within the specific domain to set appropriate thresholds. (4)

Intuition and Construction

Calculation of ROUGE-N

Variants

(Many) Observations

Thresholds

WARNING

References