misalignment

Search documents
X @Anthropic
Anthropic· 2025-06-20 19:30
New Anthropic Research: Agentic Misalignment.In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down. https://t.co/KbO4UJBBDU ...
X @Sam Altman
Sam Altman· 2025-06-18 19:04
RT Miles Wang (@MilesKWang)We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it moreWe find that emergent misalignment:- happens during reinforcement learning- is controlled by “misaligned persona” features- can be detected and mitigated🧵: https://t.co/BW6YCnf3oE ...