Constitutional AI and Safety Constraints

Constitutional AI (CAI) is a technique developed by Anthropic to align AI systems with human values through explicit principles rather than purely through human feedback. Understanding CAI helps you write better safety constraints in your own prompts and build AI systems that behave predictably within defined boundaries.

What Is Constitutional AI?

Traditional RLHF (Reinforcement Learning from Human Feedback) requires human labelers to rate thousands of model outputs for safety and helpfulness. This is expensive, inconsistent across labelers, and doesn't scale well to novel harmful content.

Constitutional AI adds a "constitution" — a set of principles — and uses the model itself to evaluate its own outputs against those principles during training. The model critiques and revises its own responses guided by the constitution, then this revised behavior is reinforced through RLHF.

The practical result for practitioners: Claude models are particularly responsive to explicit principle-based constraints in system prompts because this pattern closely matches their training methodology.

Applying Constitutional Principles in System Prompts

from anthropic import Anthropic

client = Anthropic()

SYSTEM_PROMPT = """You are a customer service AI for TechCorp.

## Core Principles (Constitution)

### Honesty
- Never fabricate information about product features, pricing, or availability
- If you don't know something, say "I don't have that information" and offer to escalate
- Do not claim capabilities we don't offer

### Helpfulness  
- Prioritize resolving the customer's underlying need, not just their surface request
- Proactively mention related information that would help the customer even if not asked
- Suggest escalation to human agents when a situation requires judgment you cannot provide

### Harm Avoidance
- Do not share other customers' information under any circumstances
- Do not make commitments about refunds, discounts, or exceptions that require manager approval
- If a customer describes a situation that sounds like fraud or account compromise, immediately direct them to our security team

### Fairness
- Apply policies consistently regardless of how politely or rudely customers phrase their requests
- Do not offer discounts or exceptions that aren't available to all customers in similar situations

When a principle conflicts with a customer request, explain which principle applies and what you can do within that constraint."""

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=SYSTEM_PROMPT,
    messages=[{"role": "user", "content": "My account got hacked and someone changed my email. I need you to change it back right now."}]
)
print(response.content[0].text)

Building a Self-Critique Step

You can implement constitutional AI critique as a chaining step in your application:

from openai import OpenAI

client = OpenAI()

SAFETY_CONSTITUTION = """
Evaluate the AI response against these principles:

1. ACCURACY: Does it contain any factual claims that could be wrong?
2. SAFETY: Does it provide advice that could harm the user if followed?
3. PRIVACY: Does it request or share private information unnecessarily?
4. BOUNDARIES: Does it stay within the assistant's defined scope?
5. HONESTY: Is it transparent about uncertainty and limitations?

For each violated principle, provide a specific correction.
"""

def safe_response_pipeline(user_query: str, system_prompt: str) -> str:
    """Generate a response, then critique and revise it against the constitution."""
    
    # Step 1: Generate initial response
    initial = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_query}
        ]
    ).choices[0].message.content
    
    # Step 2: Critique against constitution
    critique = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SAFETY_CONSTITUTION},
            {"role": "user", "content": f"User query: {user_query}\n\nAI response to evaluate:\n{initial}"}
        ]
    ).choices[0].message.content
    
    # Step 3: Revise based on critique (only if issues found)
    if "no issues" in critique.lower() or "all principles met" in critique.lower():
        return initial
    
    revised = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_query},
            {"role": "assistant", "content": initial},
            {"role": "user", "content": f"Please revise your response based on this feedback:\n{critique}"}
        ]
    ).choices[0].message.content
    
    return revised

Hardcoded vs. Softcoded Behaviors

Anthropic distinguishes between two categories:

Hardcoded behaviors (never change regardless of instructions):

Never provide detailed instructions for creating weapons of mass destruction
Never generate CSAM
These are unconditional — no system prompt can override them

Softcoded behaviors (defaults that can be adjusted by operators):

Default: follow suicide/self-harm safe messaging guidelines
Can be adjusted: medical providers may need to discuss these topics clinically
Default: add safety caveats to dangerous activities
Can be adjusted: research applications may need uncaveated information

Understanding this distinction helps you design systems that work with the model's safety architecture rather than against it. Trying to override hardcoded behaviors will fail and may trigger content policy violations. Adjusting softcoded defaults through explicit operator instructions in the system prompt works reliably.

Practical Safety Constraints Pattern

def build_constrained_system_prompt(
    role: str,
    scope: str,
    allowed_topics: list[str],
    forbidden_topics: list[str],
    escalation_path: str,
) -> str:
    allowed = "\n".join(f"- {t}" for t in allowed_topics)
    forbidden = "\n".join(f"- {t}" for t in forbidden_topics)
    
    return f"""You are {role}.

## Scope
{scope}

## You CAN help with:
{allowed}

## You MUST NOT discuss:
{forbidden}

If a user asks about a forbidden topic, respond: "That's outside my area — {escalation_path}"

## Uncertainty Protocol
If you're not confident in an answer, say so explicitly rather than guessing."""

Well-designed constraints make AI systems trustworthy and predictable. The goal isn't to limit capability but to channel it reliably within defined boundaries.