> Bai, Yuntao, et al. _Constitutional AI: Harmlessness from AI Feedback_. arXiv:2212.08073, arXiv, 15 Dec. 2022. _arXiv.org_, [https://doi.org/10.48550/arXiv.2212.08073](https://doi.org/10.48550/arXiv.2212.08073).
# Constitutional AI: Harmlessness from AI Feedback
## **Objective**
The paper introduces **Constitutional AI (CAI)**, a novel method to train AI assistants to be helpful, honest, and harmless-**without relying on human feedback labels for harmfulness**. Instead, ==**the process uses a set of guiding principles (a "constitution") and leverages AI self-improvement.**==
## **Key Concepts and Methods**
1. **Constitutional AI Approach**
- **Constitution:** A small set of clear, natural-language principles that the AI should follow (e.g., "do not give harmful advice").
- **Self-improvement:** The AI critiques and revises its own responses based on these principles, rather than relying on human-labeled data for harmfulness.
2. **Training Process**
- **==Supervised Learning (SL) Stage: Critique → Revision → Supervised Learning==**
- The AI generates responses to prompts.
- It then critiques its own responses using constitutional principles and revises them accordingly.
- The model is finetuned on these revised, self-critiqued responses.
- ==**Reinforcement Learning (RL) Stage: AI Comparison Evaluations → Preference Model → Reinforcement Learning**==
- The AI compares pairs of its own responses and chooses the better one according to the constitution.
- These preferences are used to train a preference model (PM).
- The AI is then further trained using RL, with the PM as the reward signal-this is called **Reinforcement Learning from AI Feedback (RLAIF)**.
3. **Chain-of-Thought Reasoning**
- Both SL and RL stages use chain-of-thought reasoning to make AI decision-making more transparent and robust.
## **Results and Findings**
- **Reduced Harmfulness:** The CAI-trained models are less likely to produce harmful or evasive responses compared to models trained with traditional human feedback.
- **Non-Evasiveness:** Unlike previous models that simply refuse to answer controversial or harmful queries, CAI models explain their objections, increasing transparency and usefulness.
- **Efficiency:** The method significantly reduces the need for large-scale human feedback, relying instead on a small, explicit set of principles.
- **Performance:** Crowdworkers preferred the responses of CAI-trained models over those trained with human feedback for harmlessness, at similar levels of helpfulness.
## **Contributions**
- **Scalable Supervision:** Demonstrates that AI can help supervise other AIs, scaling oversight efficiently.
- **Transparency:** Makes the principles guiding AI behavior explicit and understandable.
- **Open Resources:** The authors provide a [GitHub repository](https://github.com/anthropics/ConstitutionalHarmlessnessPaper) with example prompts, principles, and model outputs.
---
_Note: this page was at least partly written using generative AI._