Workflow
X @Anthropic
Anthropic·2025-11-21 19:30

New Anthropic research: Natural emergent misalignment from reward hacking in production RL.“Reward hacking” is where models learn to cheat on tasks they’re given during training.Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious. https://t.co/N4mRKtdNdp ...