> Lau, Jey Han, et al. 'Deep-Speare: A Joint Neural Model of Poetic Language, Meter and Rhyme'. ArXiv:1807.03491 [Cs], July 2018. arXiv.org, <http://arxiv.org/abs/1807.03491>.
# Deep-speare: A Joint Neural Model of Poetic Language, Meter and Rhyme
> In this paper, we propose a joint architecture that captures language, rhyme and meter for sonnet modelling. We assess the quality of generated poems using crowd and expert judgements. The stress and rhyme models perform very well, as generated poems are largely indistinguishable from human-written poems. Expert evaluation, however, reveals that a vanilla language model captures meter implicitly, and that machine-generated poems still underperform in terms of readability and emotion. Our research shows the importance expert evaluation for poetry generation, and that future research should look beyond rhyme/meter and focus on poetic language.
## Architecture

- 3 components: language model, pentameter model and rhyme model
- Language model uses categorical cross-entropy to predict next word given a sonnet line
- Pentameter model trained to learn the alternative iambic stress pattern
- The rhyme model, on the other hand, uses a margin-based loss to separate rhyming word pairs from non-rhyming word pairs in a quatrain.
### Language model
- Variant of an [[Long short-term memory|LSTM]] [[RNN encoder-decoder]] with [[Attention mechanism]]
#### Encoder
- Context words $z_i$ embedded with matrix $\mathbf{W}_{\mathrm{wrd}}$ to yield $\mathbf{w}_i$
- Fed to single-layer biLSTM to produce hidden states $\mathbf{h}_i = [ \overrightarrow{\mathbf{h}}_i ; \overleftarrow{\mathbf{h}}_i ]$
- Selective mechanism:[^1]
- we define the sentence representation $\mathbf{s} = [ \overrightarrow{\mathbf{h}}_C ; \overleftarrow{\mathbf{h}}_1 ]$ where $C$ is the number of context words (last forward and backward embeddings)
- we filter the hidden states $\mathbf{h}_i$ using $\mathbf{s}$:
$
\mathbf{h}^\prime_i = \mathbf{h}_i \odot \sigma(\mathbf{W}_a \mathbf{h}_i + \mathbf{U}_a \mathbf{s} + \mathbf{b}_a)
$
#### Decoder
- Embed words $x_t$ in the current line using encoder-shared matrix $\mathbf{W}_{\mathrm{wrd}}$ to produce $\mathbf{w}_t$
- Embed characters of a word using $\mathbf{W}_{\mathrm{chr}}$ to produce $\mathbf{c}_{t,i}$ and feed them to biLSTM
- Character encoding of a word: $\overline{\mathbf{u}}_t = [ \overrightarrow{\mathbf{u}}_{t,L} ; \overleftarrow{\mathbf{u}}_{t,1} ]$ where $L$ is the word length
- Provides orthographic information, ==shared with pentameter model==, improves representation of unknown words
- Concatenate word and character encoding and feed it to a word-level LSTM to produce decoding states:
$
\mathbf{s}_t = \mathrm{LSTM}([\mathbf{w}_t; \overline{\mathbf{u}}_t], \mathbf{s}_{t-1} )
$
- Attend $\mathbf{s}_t$ to encoder hidden states $\mathbf{h}_i^\prime$:
$
\begin{align}
e_i^t & = \mathbf{v}_b^\mathsf{T} \tanh(\mathbf{W}_b \mathbf{h}_i^\prime + \mathbf{U}_b \mathbf{s}_t + \mathbf{b}_b) \\
\mathbf{a}^t & = \mathrm{softmax}(\mathbf{e}^t) \\
\mathbf{h}_t^* & = \sum_i a_i^t \mathbf{h}_i^\prime \\
\mathbf{s}_t^\prime & = \mathrm{GRU}(\mathbf{s}_t, \mathbf{h}_t^*) \\
\end{align}
$
- Then feed $\mathbf{s}_t^\prime$ to linear layer with softmax activation to produce vocabulary distribution
- Optimisation with standard categorical cross-entropy loss
- Dropout as regularisation, applied to encoder/decoder LSTM outputs and word embedding lookup, idem for pentameter and rhyme models
- Small dataset so pre-train word embeddings and reduce parameters by weight-sharing: $ \mathbf{W}_\mathrm{out} = \tanh(\mathbf{W}_\mathrm{wrd} \mathbf{W}_\mathrm{prj}) $
### Pentameter model
- Goal: predict 10 binary stress symbols sequentially
- Preprocess data to remove punctuation → only spaces and letters
- [[RNN encoder-decoder]]
#### Encoder
- Embed characters using $\mathbf{W}_\mathrm{chr}$, feed them to character-level biLSTM to produce character encodings: $\mathbf{u}_j = [ \overrightarrow{\mathbf{u}}_j ; \overleftarrow{\mathbf{u}}_j ]$
#### Decoder
- $ \mathbf{g}_t = \mathrm{LSTM}(\mathbf{u}_{t-1}^*, \mathbf{g}_{t-1})$
where $\mathbf{u}_{t-1}^*$ is the weighted sum of characters encodings from the previous time step, produced by an attention network described below, and $\mathbf{g}_t$ is fed to a linear layer with softmax activation to compute the stress distribution
- Attention network designed to focus on stress-producing characters, whose positions are monotonically increasing
- First, compute $\mu_t$, the mean position of focus:
$
\begin{align}
\mu_t^\prime & = \sigma(\mathbf{v}_c^
\mathsf{T} \tanh(\mathbf{W}_c \mathbf{g}_t + \mathbf{U}_c \mu_{t-1} + \mathbf{b}_c)) \\
\mu_t & = M \times \min(\mu_t^\prime + \mu_{t-1}, 1.0) \\
\end{align}$
where $M$ is the number of characters in the sonnet line
- Given $\mu_t$, compute unnormalised probability for each character position:
$ p_j^t = \exp \left( \frac{-(j-\mu_t)^2}{2T^2} \right) $
where standard deviation $T$ is a hyperparameter
- Compute $\mathbf{u}_t^*$:
$
\begin{align}
\mathbf{u}_j^\prime & = p_j^t \mathbf{u}_j \\
d_j^t & = \mathbf{v}_d^\mathsf{T} \tanh(\mathbf{W}_d \mathbf{u}_j^\prime + \mathbf{U}_d g_t + \mathbf{b}_d) \\
\mathbf{f}^t & = \mathrm{softmax}(\mathbf{d}^t + \log \mathbf{p}^t) \\
\mathbf{u}_t^* & = \sum_j b_j^t \mathbf{u}_j \\
\end{align}
$
- Initial input $\mathbf{u}_0^*$ and state $\mathbf{g}_0$ are a trainable and zero vector respectively
- Spaces are masked out, so they always yield zero attention weights
- The position information intervenes when computing (1) $d_j^t$ by weighting the character encodings and (2) $\mathbf{f}^t$ by adding the position log probabilities → best performance in practice
- ==Normally $\mathbf{u}_t^*$ would be combined with $\mathbf{g}_t$ to compute the output probability distribution, but the model would quickly learn to ignore $\mathbf{u}_t^*$==
- Therefore, we use only $\mathbf{u}_t^*$:
$
P(S^-) = \sigma(\mathbf{W}_e \mathbf{u}_t^*) + b_e
$
which gives the loss $\mathcal{L}_\mathrm{ent} = \sum_t - \log P(S_t^*)$ for the whole sequence, where $S_t^*$ is the target stress at time step $t$
- Tendency to attend to the same characters despite the incorporation of position information → two loss penalties for regularisation
- Repeat loss penalises the model when it attends to previously attended characters by keeping a sum of attention weights over all previous time steps
$
\mathcal{L}_\mathrm{rep} = \sum_t \sum_j \min(f_j^t, \sum_{t=1}^{t-1} f_j^t)
$
- Coverage loss penalises the model when vowels are ignored:
$
\mathcal{L}_\mathrm{cov} = \sum_{j \in V} \mathrm{ReLU}\left(C - \sum_{t=1}^{10} f_j^t \right)
$
where $V$ is a set of positions containing vowel characters, and $C$ is a hyperparameter that defines the minimum attention threshold that avoids penalty
- Total loss
$
\mathcal{L}_\mathrm{pm} = \mathcal{L}_\mathrm{ent} + \alpha \mathcal{L}_\mathrm{rep} + \beta \mathcal{L}_\mathrm{cov}
$
### Rhyme model
- Unsupervised rhyme learning for two reasons: (1) extend model to other languages, and (2) dataset is not in modern English
- Feed sentence-ending word pairs of a quatrain as input to the rhyme model to train it to separate rhyming from non-rhyming pairs; no rhyming scheme assumed
- Training example: three word pairs generated by pairing one target word with other reference words in the quatrain; we assume that there is only one rhyming pair
- Experiments show performance improves when increasing the number of negative examples
- For each word $x$ in the word pairs, embed characters using $\mathbf{W}_\mathrm{chr}$ and feed them to a unidirectional forward LSTM to produce character states $\mathbf{u}_j$; LSTM parameters are not shared
- Word encoding represented by last state $\overline{\mathbf{u}} = \mathbf{u}_L$ where $L$ is the length of the word
- Margin-based loss to optimise the model:
$
\begin{align}
Q & = \{ \cos(\overline{\mathbf{u}}_t, \overline{\mathbf{u}}_r), \cos(\overline{\mathbf{u}}_t, \overline{\mathbf{u}}_{r+1}), \dots \} \\
\mathcal{L}_\mathrm{rm} & = \max (0, \delta - \mathrm{top}(Q,1) + \mathrm{top}(Q,2)) \\
\end{align}$
where $\mathrm{top}(Q,k)$ returns the $k$-th largest element in $Q$, and $\delta$ is a hyperparameter
- Model trained to learn a sufficient margin (defined by $\delta$) that separates the best pair from all others
- During generation, estimate whether two words rhyme by computing cosine similarity score, and resample words as necessary to enforce rhyme
### Generation procedure
- Focus on quatrain generation
- Feed hidden state from previous time step to language model’s decoder to compute vocabulary distribution for current time step
- Sample words with temperature between $0.6$ and $0.8$
- Resample if (1) UNK token; (2) non-stopword that was generated before; (3) any generated word with frequency $\geq 2$; (4) preceding 3 words; and (5) symbols including paretheses and quotes
- First sonnet line generated without any context
- Pentameter model incorporation: given a sonnet line, pentameter model computes a loss $\mathcal{L}_\mathrm{pm}$; 10 candidate lines generated with same hidden state, $\mathcal{L}_\mathrm{pm}$ scores converted into probabilities by taking their softmax, and sample one line with temperature $0.1$ based on
- Rhyme model incoporation: randomly select a rhyming scheme and resample sentence-ending words as necessary
- Language model generates words from last to first
---
## 📚 References
[^1]: Zhou, Qingyu, et al. “Selective Encoding for Abstractive Sentence Summarization.” Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 1095–104. arXiv.org, https://doi.org/10.18653/v1/P17-1101.