> Lin, Chin-Yew. 'ROUGE: A Package for Automatic Evaluation of Summaries'. _Text Summarization Branches Out_, Association for Computational Linguistics, 2004, pp. 74–81. _ACLWeb_, [https://www.aclweb.org/anthology/W04-1013](https://www.aclweb.org/anthology/W04-1013). # ROUGE: A Package for Automatic Evaluation of Summaries ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures the quality of a candidate summary by comparing it to reference summaries. ## ROUGE-N: n-gram co-occurence statistics Given a candidate summary $C$ and a reference summary $R$ composed of multiple sentences $s$, we have: $ \textrm{ROUGE-N}(C,R) = \frac{\displaystyle{\sum_{s \in \textrm{R}} \sum_{\textrm{gram}_n \in s} N_\textrm{matches}(\textrm{gram}_n)}}{\displaystyle{\sum_{s \in \textrm{R}} \sum_{\textrm{gram}_n \in s} N(\textrm{gram}_n)}} $ Given multiple references $R_i$, we compute: $ \textrm{ROUGE-N}_\textrm{multi}(C) = \max_i \, \textrm{ROUGE-N}(C,R_i) $ ## ROUGE-L: longest common subsequence A sequence $Z = [z_1, \dots, z_m]$ is a subsequence of $X = [x_1, \dots, x_n]$ if there exists a strict incresing sequence of indices $[i_1, \dots, i_m]$ of $X$ such that $\forall j \in [|1, \dots, m|], x_{i_j} = z_j$. ### Sentence-level LCS We use an LCS-based F-measure to estimate the similarity between a reference sentence $r$ of length $m$ and a candidate sentence $c$ of length $n$. $ R_\textrm{LCS} = \frac{\mathrm{LCS}(r, c)}{m} $ $ P_\textrm{LCS} = \frac{\mathrm{LCS}(r, c)}{n} $ $ \textrm{ROUGE-L} = F_\textrm{LCS} = \frac{(1+\beta^2) \, R_\textrm{LCS} \, P_\textrm{LCS}}{R_\textrm{LCS} + \beta^2 P_\textrm{LCS}} $ where $\beta = P_\textrm{LCS}/R_\textrm{LCS}$. ROUGE-L does not require consecutive matches but in-sequence matches that reflect sentence level word order, and it automatically includes longest in-sequence common n-grams, so no predefined n-gram length is required. However, alternative LCSes and shorter sequences are not reflected. ### Summary-level LCS Given a reference summary of $u$ sentences $r_1, \dots, r_u$ with a total of $m$ words and a candidate summary $C$ of $v$ sentences with a total $n$ words, we define: $ R_\textrm{LCS} = \frac{\displaystyle{\sum_{i=1}^{u}\textrm{LCS}(r_{i},C)}}{m} $ $ P_\textrm{LCS} = \frac{\displaystyle{\sum_{i=1}^{u}\textrm{LCS}(r_{i},C)}}{n} $ $ \textrm{ROUGE-L} = F_\textrm{LCS} = \frac{(1+\beta^2) \, R_\textrm{LCS} \, P_\textrm{LCS}}{R_\textrm{LCS} + \beta^2 P_\textrm{LCS}} $ ## ROUGE-W: weighted longest common subsequence ROUGE-W improves on LCS by storing the length of consecutive matches encountered so far in a regular 2D dynamic program table, so that candidates with more consecutive matches are better rated than other candidates. By providing different weighting functions $f$, we can parameterize the WLCS algorithm. $f$ must verify $ \forall (x,y) \in \mathbb{N}^2, f(x+y) > f(x) + f(y) $ eg for $f(k) = \alpha k - \beta$ with $k \in \mathbb{N}$ and $\alpha, \beta > 0$. We then have: $ R_\textrm{WLCS} = f^{-1}\left( \frac{\textrm{WLCS}(R,C)}{f(m)} \right) $ $ P_\textrm{WLCS} = f^{-1}\left( \frac{\textrm{WLCS}(R,C)}{f(n)} \right) $ $ \textrm{ROUGE-W} = F_\textrm{WLCS} = \frac{(1+\beta^2) \, R_\textrm{WLCS} \, P_\textrm{WLCS}}{R_\textrm{WLCS} + \beta^2 P_\textrm{WLCS}} $ ## ROUGE-S: skip-bigram co-occurrence statistics Skip-bigram is any pair of words in their sentence order, allowing for arbitrary gaps. $ R_\textrm{SKIP2} = f^{-1}\left( \frac{\textrm{SKIP2}(R,C)}{\binom{m}{2}} \right) $ $ P_\textrm{SKIP2} = f^{-1}\left( \frac{\textrm{SKIP2}(R,C)}{\binom{n}{2}} \right) $ $ \textrm{ROUGE-S} = F_\textrm{SKIP2} = \frac{(1+\beta^2) \, R_\textrm{SKIP2} \, P_\textrm{SKIP2}}{R_\textrm{SKIP2} + \beta^2 P_\textrm{SKIP2}} $ Contrary to BLEU, skip-bigram does not require consecutive matches but is still sensitive to word order. Compared with LCS, skip-bigram counts all in-order matching word pairs while LCS only counts one longest common subsequence. To reduce spurious matches such as "the the", we can limit the maximum skip distance $d_\textrm{SKIP2}$ between two in-order words. If $d_\textrm{SKIP2} = 0$, then ROUGE-S is equivalent to ROUGE-2. ## ROUGE-SU: extension of ROUGE-S One problem of ROUGE-S is that it does not give any credit to a candidate sentence if the sentence does not have any overlapping word pair, even though it has common words. For instance, the candidate "gunman the killed police" would have a score of 0 with the reference "police killed the gunman", being its exact reverse. We can therefore extend ROUGE-S with the addition of unigram as counting unit; this version is called ROUGE-SU. We can obtain ROUGE-SU from ROUGE-S by adding a begin-of-sentence marker.