> Lin, Chin-Yew. 'ROUGE: A Package for Automatic Evaluation of Summaries'. _Text Summarization Branches Out_, Association for Computational Linguistics, 2004, pp. 74–81. _ACLWeb_, [https://www.aclweb.org/anthology/W04-1013](https://www.aclweb.org/anthology/W04-1013).
# ROUGE: A Package for Automatic Evaluation of Summaries
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures the quality of a candidate summary by comparing it to reference summaries.
## ROUGE-N: n-gram co-occurence statistics
Given a candidate summary $C$ and a reference summary $R$ composed of multiple sentences $s$, we have:
$
\textrm{ROUGE-N}(C,R) = \frac{\displaystyle{\sum_{s \in \textrm{R}} \sum_{\textrm{gram}_n \in s} N_\textrm{matches}(\textrm{gram}_n)}}{\displaystyle{\sum_{s \in \textrm{R}} \sum_{\textrm{gram}_n \in s} N(\textrm{gram}_n)}}
$
Given multiple references $R_i$, we compute:
$
\textrm{ROUGE-N}_\textrm{multi}(C) = \max_i \, \textrm{ROUGE-N}(C,R_i)
$
## ROUGE-L: longest common subsequence
A sequence $Z = [z_1, \dots, z_m]$ is a subsequence of $X = [x_1, \dots, x_n]$ if there exists a strict incresing sequence of indices $[i_1, \dots, i_m]$ of $X$ such that $\forall j \in [|1, \dots, m|], x_{i_j} = z_j$.
### Sentence-level LCS
We use an LCS-based F-measure to estimate the similarity between a reference sentence $r$ of length $m$ and a candidate sentence $c$ of length $n$.
$
R_\textrm{LCS} = \frac{\mathrm{LCS}(r, c)}{m}
$
$
P_\textrm{LCS} = \frac{\mathrm{LCS}(r, c)}{n}
$
$
\textrm{ROUGE-L} = F_\textrm{LCS} = \frac{(1+\beta^2) \, R_\textrm{LCS} \, P_\textrm{LCS}}{R_\textrm{LCS} + \beta^2 P_\textrm{LCS}}
$
where $\beta = P_\textrm{LCS}/R_\textrm{LCS}$.
ROUGE-L does not require consecutive matches but in-sequence matches that reflect sentence level word order, and it automatically includes longest in-sequence common n-grams, so no predefined n-gram length is required. However, alternative LCSes and shorter sequences are not reflected.
### Summary-level LCS
Given a reference summary of $u$ sentences $r_1, \dots, r_u$ with a total of $m$ words and a candidate summary $C$ of $v$ sentences with a total $n$ words, we define:
$
R_\textrm{LCS} = \frac{\displaystyle{\sum_{i=1}^{u}\textrm{LCS}(r_{i},C)}}{m}
$
$
P_\textrm{LCS} = \frac{\displaystyle{\sum_{i=1}^{u}\textrm{LCS}(r_{i},C)}}{n}
$
$
\textrm{ROUGE-L} = F_\textrm{LCS} = \frac{(1+\beta^2) \, R_\textrm{LCS} \, P_\textrm{LCS}}{R_\textrm{LCS} + \beta^2 P_\textrm{LCS}}
$
## ROUGE-W: weighted longest common subsequence
ROUGE-W improves on LCS by storing the length of consecutive matches encountered so far in a regular 2D dynamic program table, so that candidates with more consecutive matches are better rated than other candidates.
By providing different weighting functions $f$, we can parameterize the WLCS algorithm. $f$ must verify
$
\forall (x,y) \in \mathbb{N}^2, f(x+y) > f(x) + f(y)
$
eg for $f(k) = \alpha k - \beta$ with $k \in \mathbb{N}$ and $\alpha, \beta > 0$.
We then have:
$
R_\textrm{WLCS} = f^{-1}\left( \frac{\textrm{WLCS}(R,C)}{f(m)} \right)
$
$
P_\textrm{WLCS} = f^{-1}\left( \frac{\textrm{WLCS}(R,C)}{f(n)} \right)
$
$
\textrm{ROUGE-W} = F_\textrm{WLCS} = \frac{(1+\beta^2) \, R_\textrm{WLCS} \, P_\textrm{WLCS}}{R_\textrm{WLCS} + \beta^2 P_\textrm{WLCS}}
$
## ROUGE-S: skip-bigram co-occurrence statistics
Skip-bigram is any pair of words in their sentence order, allowing for arbitrary gaps.
$
R_\textrm{SKIP2} = f^{-1}\left( \frac{\textrm{SKIP2}(R,C)}{\binom{m}{2}} \right)
$
$
P_\textrm{SKIP2} = f^{-1}\left( \frac{\textrm{SKIP2}(R,C)}{\binom{n}{2}} \right)
$
$
\textrm{ROUGE-S} = F_\textrm{SKIP2} = \frac{(1+\beta^2) \, R_\textrm{SKIP2} \, P_\textrm{SKIP2}}{R_\textrm{SKIP2} + \beta^2 P_\textrm{SKIP2}}
$
Contrary to BLEU, skip-bigram does not require consecutive matches but is still sensitive to word order.
Compared with LCS, skip-bigram counts all in-order matching word pairs while LCS only counts one longest common subsequence.
To reduce spurious matches such as "the the", we can limit the maximum skip distance $d_\textrm{SKIP2}$ between two in-order words. If $d_\textrm{SKIP2} = 0$, then ROUGE-S is equivalent to ROUGE-2.
## ROUGE-SU: extension of ROUGE-S
One problem of ROUGE-S is that it does not give any credit to a candidate sentence if the sentence does not have any overlapping word pair, even though it has common words. For instance, the candidate "gunman the killed police" would have a score of 0 with the reference "police killed the gunman", being its exact reverse.
We can therefore extend ROUGE-S with the addition of unigram as counting unit; this version is called ROUGE-SU. We can obtain ROUGE-SU from ROUGE-S by adding a begin-of-sentence marker.