> Nie, Shen, et al. _Large Language Diffusion Models_. arXiv:2502.09992, arXiv, 18 Feb. 2025. _arXiv.org_, [https://doi.org/10.48550/arXiv.2502.09992](https://doi.org/10.48550/arXiv.2502.09992).
# Large Language Diffusion Models
## Background & Motivation
- **Autoregressive models (ARMs)**-which predict text one token at a time-are the foundation of current large language models (LLMs).
- The authors challenge the assumption that ARMs are the only viable approach for high-performing LLMs, arguing that the core properties of LLMs stem from general **generative modeling principles**, not the autoregressive formulation itself.
## Key Contribution: LLaDA
- The paper introduces **LLaDA** (Large Language Diffusion with mAsking), a **diffusion-based language model** trained from scratch.
- ==**LLaDA models language by gradually masking and then reconstructing tokens in a sequence, using a bidirectional Transformer to predict masked tokens in parallel.**==
- This approach allows for **bidirectional context** (unlike left-to-right ARMs) and offers a new generative modeling framework for language.

## Methodology
- **Diffusion Process:**
- **Forward:** Randomly masks tokens in a sequence with varying probabilities.
- **Reverse:** The model predicts all masked tokens simultaneously to reconstruct the original sequence.
- **Training:**
- Pre-trained on 2.3 trillion tokens (comparable to state-of-the-art LLMs).
- Fine-tuned on 4.5 million instruction-following pairs.
- Uses standard Transformer architecture (without causal masking).
- **Evaluation:**
- Benchmarked on standard LLM tasks: language understanding, math, code, and Chinese.
- Compared to strong ARM baselines (e.g., LLaMA2 7B, LLaMA3 8B).
## Results & Findings
- **Scalability:** LLaDA scales well to 8B parameters and is competitive with leading ARMs of similar size.
- **In-Context Learning:** LLaDA 8B matches or outperforms LLaMA2 7B and is on par with LLaMA3 8B on zero/few-shot tasks.
- **Instruction Following:** After supervised fine-tuning, LLaDA demonstrates strong instruction-following and multi-turn dialogue abilities.
- **Reversal Reasoning:** LLaDA excels at tasks requiring "reversal reasoning" (e.g., reversal poem completion), outperforming even GPT-4o-an area where ARMs typically struggle due to their left-to-right bias.
- **Efficiency:** LLaDA's parallel token prediction enables more efficient generation compared to sequential ARMs.
## Implications
- ==**Diffusion models are a viable and promising alternative to autoregressive models for LLMs.**==
- **Key LLM abilities (scalability, in-context learning, instruction following) are not exclusive to ARMs.**
- LLaDA's approach addresses some ARM limitations, such as inefficient sequential generation and poor performance on certain reasoning tasks.
---
_Note: this page was at least partly written using generative AI._