Workflow
Safety evaluations
icon
Search documents
X @Anthropic
Anthropicยท 2025-07-24 17:21
New Anthropic research: Building and evaluating alignment auditing agents.We developed three AI agents to autonomously complete alignment auditing tasks.In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors. https://t.co/HMQhMaA4v0 ...