X @Anthropic
Anthropic·2025-07-08 22:11
We tested whether LLMs comply more with requests when they know they’re being trained versus unmonitored.One reason they do this is that they plan to "fake alignment” when told to answer harmful queries.https://t.co/sgVnWMSPuSAnthropic (@AnthropicAI):New Anthropic research: Alignment faking in large language models.In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences. https://t.co/nX ...