> Ouyang, Long, et al. _Training Language Models to Follow Instructions with Human Feedback_. arXiv:2203.02155, arXiv, 4 Mar. 2022. _arXiv.org_, [https://doi.org/10.48550/arXiv.2203.02155](https://doi.org/10.48550/arXiv.2203.02155).
# Training language models to follow instructions with human feedback
## Introduction
- Language modeling objective (predicting the next token) not _aligned_ with following the user's instructions
- Reinforcement learning from human feedback (RLHF):
1. Supervised fine-tuning (SFT)
2. Reward model (RM) training
3. Reinforcement learning (RL) via proximal policy optimization (PPO)
- Main findings:
1. Labelers prefer InstructGPT over GPT-3
2. InstructGPT is more truthful than GPT-3
3. InstructGPT less toxic than GPT-3, but not less biased
4. Modifying RLHF procedure mitigates performance regressions
5. Results generalize to held-out labelers (who did not produce training data)
6. Public NLP datasets perform worse than InstructGPT's training dataset
7. Good generalization to tasks outside RLHF fine-tuning
8. InsructGPT still makes simple mistakes
## Methods
### High-level Methodology
1. Collect demonstration data, train a supervised policy
2. Collect comparison data, train a reward model
3. Optimize a policy against the reward model using PPO
### Dataset
3 kinds of prompts:
1. Plain: arbitrary task with sufficient diversity
2. Few-shot: instruction with multiple query/response pairs
3. User-based: use-cases from waitlist applications of OpenAI API
### Human data collection
- 40 contractors hired on Upwork and ScaleAI, screening test
- Prioritize truthfulness and harmlessness
- Inter-annotator agreement: 73-77%
### Models
- SFT: overfitting after 1 epoch, but RM score and human preference ratings still improve if training continues
- RM: 6B model was enough, train on all $\binom{K}{2}$ response comparisons from each prompt
- RL: fine-tune SFT model using PPO, with value function initialized from RM
- Main metric: labeler preference ratings
### Results