Workflow
Alignment faking
icon
Search documents
X @Anthropic
Anthropic· 2025-07-08 22:11
We found that refusal training inhibits alignment faking in most models. Just training LLMs to comply more with harmful queries doesn't increase alignment faking, but training them to comply with generic threats or to answer questions about the scenario does. https://t.co/67t5FbNDwn ...
X @Anthropic
Anthropic· 2025-07-08 22:11
Our new study found that only 5 of 25 models showed higher compliance in the “training” scenario. Of those, only Claude Opus 3 and Sonnet 3.5 showed >1% alignment-faking reasoning.We explore why these models behave differently, and why most models don't show alignment faking. https://t.co/24K0iNxDpQ ...