> Van de Cruys, Tim. 'Automatic Poetry Generation from Prosaic Text'. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2020, pp. 2471–80. ACLWeb, <https://doi.org/10.18653/v1/2020.acl-main.223>. # Automatic Poetry Generation from Prosaic Text > In the last few years, a number of successful approaches have emerged that are able to adequately model various aspects of natural language. In particular, language models based on neural networks have improved the state of the art with regard to predictive language modeling, while topic models are successful at capturing clear-cut, semantic dimensions. In this paper, we will explore how these approaches can be adapted and combined to model the linguistic and literary aspects needed for poetry generation. The system is exclusively trained on standard, non-poetic text, and its output is constrained in order to confer a poetic character to the generated verse. The framework is applied to the generation of poems in both English and French, and is equally evaluated for both languages. Even though it only uses standard, non-poetic text as input, the system yields state of the art results for poetry generation. ## Introduction - [[RNN encoder-decoder]] - Poetic constraints enforced by modifying the probability distribution yielded by the decoder - Training on prosaic texts extracted from the web ## Model ![Poetry generation model](https://i.imgur.com/V1KELpy.png) ### Neural architecture - [[Gated recurrent unit]] for seq2seq - General [[Attention mechanism]] plugged in - Words are sampled randomly according to the output probability distribution - Decoder trained to predict the next sentence in reverse: the last word is the first one that is generated - Important for effective rhyme incorporation ### Poetic constraints as *a priori* distributions #### Rhyme constraint - Phonetic representation of words from *Wiktionary* - Prior probability distribution: $ p_\textrm{rhyme}(\mathbf{w}) = \frac{1}{Z} \mathbf{x} \quad \text{where} \quad \begin{cases} x_i = 1 & \text{if } i \in R \\ x_i = \varepsilon & \text{otherwise} \end{cases} $ where $R$ is the set of words that rhyme with the word $\mathbf{w}$. - The prior distribution is updated by the decoder's distribution: $ p_\mathrm{out}(\mathbf{w}) = \frac{ 1 }{ Z } \left[ p(\mathbf{w}^t | w^{<t}, S_i) \odot p_\mathrm{rhyme}(\mathbf{w}) \right] $ #### Topical constraint - Latent semantic model based on [[Non-negative matrix factorisation]] - Able to induce clear-cut, interpretable topical dimensions - Input: frequency matrix $\mathbf{A}$ (co-occurrence frequencies, weighted by point-wise mutual information), factorized into two non-negative matrices $\mathbf{W}$ and $\mathbf{H}$: $\mathbf{A}_{i \times j} \approx \mathbf{W}_{i \times k} \, \mathbf{H}_{k \times j} $ where $k \ll i,j$ - Minimise Kullback-Leibler divergence between $\mathbf{A}$ and $\mathbf{WH}$ through gradient descent: $ \mathbf{H}_{a\mu} \leftarrow \mathbf{H}_{a\mu} \frac{\sum_i \mathbf{W}_{ia} \frac{\mathbf{A}_{i\mu}}{(\mathbf{WH}_{i\mu})}}{\sum_k \mathbf{W}_{ka}} $ $ \mathbf{W}_{ia} \leftarrow \mathbf{W}_{ia} \frac{\sum_\mu \mathbf{H}_{a\mu} \frac{\mathbf{A}_{i\mu}}{(\mathbf{WH}_{i\mu})}}{\sum_v \mathbf{H}_{av}} $ - $\mathbf{W}$ can be seen as $p(\mathbf{w} | k )$ the probability of a word $\mathbf{w}$ given a latent dimension $k$. $p(\mathbf{w} | k )$ can then be used as another prior distribution but, in order to maintain syntactic consistency, it is only used if the output distribution's entropy is high, ie, the model is not confident in its next word #### Global optimization framework - Verse generation is a sampling process → optimisation for best sample - Can define helpful additional criteria - Multiple candidates evaluated with multiple scores → min-max normalisation and harmonic mean for ranking - They equally experimented with rhythmic constraints based on meter and stress, but initial experiments indicated that the system tends to output very rigid verse. Simple syllable counting tends to yield more interesting variation ## Results and evaluation ### Implementation details - Training corpus of 500 million words for each language, vocabulary of 15K words - Both encoder and decoder made up of two GRU layers with a hidden state of size 2048, and the word embeddings are of size 512 - Encoder, decoder, and output embeddings are all shared - Model parameters are optimized using stochastic gradient descent with an initial learning rate of 0.2, which is divided by 4 when the loss does no longer improve on a held-out validation set. - We use a batch size of 64, and we apply gradient clipping - For the application of the topical constraint, we use an entropy threshold of 2.70. - The n-gram model is a standard Kneser-Ney smoothed trigram model implemented using KenLM - The NMF model is factorized to 100 dimensions - The n-gram model and the NMF model are trained on a 10 billion-word corpus - For syllable length, we use $\mu = 12$; $\sigma = 2$. - We generate about 2000 candidates for each verse, no cherry-picking in evaluation ### Evaluation procedure - Human evaluation on fluency, coherence, meaningfulness and poeticness