Workflow
Misalignment
icon
Search documents
X @Anthropic
Anthropic· 2025-07-22 16:32
Subliminal learning can occur for benign traits (such as liking eagles) or more concerning traits (such as misalignment). This has consequences for training on model-generated data.Read more on our Alignment Science blog: https://t.co/BWbgK82P02 https://t.co/sPfm6WC3JA ...
X @Anthropic
Anthropic· 2025-06-20 19:30
New Anthropic Research: Agentic Misalignment.In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down. https://t.co/KbO4UJBBDU ...
X @Sam Altman
Sam Altman· 2025-06-18 19:04
RT Miles Wang (@MilesKWang)We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it moreWe find that emergent misalignment:- happens during reinforcement learning- is controlled by “misaligned persona” features- can be detected and mitigated🧵: https://t.co/BW6YCnf3oE ...