> Ouyang, Long, et al. _Training Language Models to Follow Instructions with Human Feedback_. arXiv:2203.02155, arXiv, 4 Mar. 2022. _arXiv.org_, [https://doi.org/10.48550/arXiv.2203.02155](https://doi.org/10.48550/arXiv.2203.02155). # Training language models to follow instructions with human feedback ## Introduction - Language modeling objective (predicting the next token) not _aligned_ with following the user's instructions - Reinforcement learning from human feedback (RLHF): 1. Supervised fine-tuning (SFT) 2. Reward model (RM) training 3. Reinforcement learning (RL) via proximal policy optimization (PPO) - Main findings: 1. Labelers prefer InstructGPT over GPT-3 2. InstructGPT is more truthful than GPT-3 3. InstructGPT less toxic than GPT-3, but not less biased 4. Modifying RLHF procedure mitigates performance regressions 5. Results generalize to held-out labelers (who did not produce training data) 6. Public NLP datasets perform worse than InstructGPT's training dataset 7. Good generalization to tasks outside RLHF fine-tuning 8. InsructGPT still makes simple mistakes ## Methods ### High-level Methodology 1. Collect demonstration data, train a supervised policy 2. Collect comparison data, train a reward model 3. Optimize a policy against the reward model using PPO ### Dataset 3 kinds of prompts: 1. Plain: arbitrary task with sufficient diversity 2. Few-shot: instruction with multiple query/response pairs 3. User-based: use-cases from waitlist applications of OpenAI API ### Human data collection - 40 contractors hired on Upwork and ScaleAI, screening test - Prioritize truthfulness and harmlessness - Inter-annotator agreement: 73-77% ### Models - SFT: overfitting after 1 epoch, but RM score and human preference ratings still improve if training continues - RM: 6B model was enough, train on all $\binom{K}{2}$ response comparisons from each prompt - RL: fine-tune SFT model using PPO, with value function initialized from RM - Main metric: labeler preference ratings ### Results