@baiConstitutionalAIHarmlessness2022

> Bai, Yuntao, et al. _Constitutional AI: Harmlessness from AI Feedback_. arXiv:2212.08073, arXiv, 15 Dec. 2022. _arXiv.org_, [https://doi.org/10.48550/arXiv.2212.08073](https://doi.org/10.48550/arXiv.2212.08073). # Constitutional AI: Harmlessness from AI Feedback ## **Objective** The paper introduces **Constitutional AI (CAI)**, a novel method to train AI assistants to be helpful, honest, and harmless-**without relying on human feedback labels for harmfulness**. Instead, ==**the process uses a set of guiding principles (a "constitution") and leverages AI self-improvement.**== ## **Key Concepts and Methods** 1. **Constitutional AI Approach** - **Constitution:** A small set of clear, natural-language principles that the AI should follow (e.g., "do not give harmful advice"). - **Self-improvement:** The AI critiques and revises its own responses based on these principles, rather than relying on human-labeled data for harmfulness. 2. **Training Process** - **==Supervised Learning (SL) Stage: Critique → Revision → Supervised Learning==** - The AI generates responses to prompts. - It then critiques its own responses using constitutional principles and revises them accordingly. - The model is finetuned on these revised, self-critiqued responses. - ==**Reinforcement Learning (RL) Stage: AI Comparison Evaluations → Preference Model → Reinforcement Learning**== - The AI compares pairs of its own responses and chooses the better one according to the constitution. - These preferences are used to train a preference model (PM). - The AI is then further trained using RL, with the PM as the reward signal-this is called **Reinforcement Learning from AI Feedback (RLAIF)**. 3. **Chain-of-Thought Reasoning** - Both SL and RL stages use chain-of-thought reasoning to make AI decision-making more transparent and robust. ## **Results and Findings** - **Reduced Harmfulness:** The CAI-trained models are less likely to produce harmful or evasive responses compared to models trained with traditional human feedback. - **Non-Evasiveness:** Unlike previous models that simply refuse to answer controversial or harmful queries, CAI models explain their objections, increasing transparency and usefulness. - **Efficiency:** The method significantly reduces the need for large-scale human feedback, relying instead on a small, explicit set of principles. - **Performance:** Crowdworkers preferred the responses of CAI-trained models over those trained with human feedback for harmlessness, at similar levels of helpfulness. ## **Contributions** - **Scalable Supervision:** Demonstrates that AI can help supervise other AIs, scaling oversight efficiently. - **Transparency:** Makes the principles guiding AI behavior explicit and understandable. - **Open Resources:** The authors provide a [GitHub repository](https://github.com/anthropics/ConstitutionalHarmlessnessPaper) with example prompts, principles, and model outputs. --- _Note: this page was at least partly written using generative AI._