> Nie, Shen, et al. _Large Language Diffusion Models_. arXiv:2502.09992, arXiv, 18 Feb. 2025. _arXiv.org_, [https://doi.org/10.48550/arXiv.2502.09992](https://doi.org/10.48550/arXiv.2502.09992). # Large Language Diffusion Models ## Background & Motivation - **Autoregressive models (ARMs)**-which predict text one token at a time-are the foundation of current large language models (LLMs). - The authors challenge the assumption that ARMs are the only viable approach for high-performing LLMs, arguing that the core properties of LLMs stem from general **generative modeling principles**, not the autoregressive formulation itself. ## Key Contribution: LLaDA - The paper introduces **LLaDA** (Large Language Diffusion with mAsking), a **diffusion-based language model** trained from scratch. - ==**LLaDA models language by gradually masking and then reconstructing tokens in a sequence, using a bidirectional Transformer to predict masked tokens in parallel.**== - This approach allows for **bidirectional context** (unlike left-to-right ARMs) and offers a new generative modeling framework for language. ![|600](https://i.imgur.com/0PZn8As.png) ## Methodology - **Diffusion Process:** - **Forward:** Randomly masks tokens in a sequence with varying probabilities. - **Reverse:** The model predicts all masked tokens simultaneously to reconstruct the original sequence. - **Training:** - Pre-trained on 2.3 trillion tokens (comparable to state-of-the-art LLMs). - Fine-tuned on 4.5 million instruction-following pairs. - Uses standard Transformer architecture (without causal masking). - **Evaluation:** - Benchmarked on standard LLM tasks: language understanding, math, code, and Chinese. - Compared to strong ARM baselines (e.g., LLaMA2 7B, LLaMA3 8B). ## Results & Findings - **Scalability:** LLaDA scales well to 8B parameters and is competitive with leading ARMs of similar size. - **In-Context Learning:** LLaDA 8B matches or outperforms LLaMA2 7B and is on par with LLaMA3 8B on zero/few-shot tasks. - **Instruction Following:** After supervised fine-tuning, LLaDA demonstrates strong instruction-following and multi-turn dialogue abilities. - **Reversal Reasoning:** LLaDA excels at tasks requiring "reversal reasoning" (e.g., reversal poem completion), outperforming even GPT-4o-an area where ARMs typically struggle due to their left-to-right bias. - **Efficiency:** LLaDA's parallel token prediction enables more efficient generation compared to sequential ARMs. ## Implications - ==**Diffusion models are a viable and promising alternative to autoregressive models for LLMs.**== - **Key LLM abilities (scalability, in-context learning, instruction following) are not exclusive to ARMs.** - LLaDA's approach addresses some ARM limitations, such as inefficient sequential generation and poor performance on certain reasoning tasks. --- _Note: this page was at least partly written using generative AI._