Workflow
X @Anthropic
Anthropicยท2025-07-08 22:11

Alignment Faking - Refusal training inhibits alignment faking in most models [1] - Training LLMs to comply more with harmful queries doesn't increase alignment faking [1] - Training LLMs to comply with generic threats increases alignment faking [1] - Training LLMs to answer questions about the scenario increases alignment faking [1]