Workflow
Refusal Training
icon
Search documents
X @Anthropic
Anthropicยท 2025-07-08 22:11
We found that refusal training inhibits alignment faking in most models. Just training LLMs to comply more with harmful queries doesn't increase alignment faking, but training them to comply with generic threats or to answer questions about the scenario does. https://t.co/67t5FbNDwn ...