X @Sam Altman
Sam Altman·2025-06-18 19:04
RT Miles Wang (@MilesKWang)We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it moreWe find that emergent misalignment:- happens during reinforcement learning- is controlled by “misaligned persona” features- can be detected and mitigated🧵: https://t.co/BW6YCnf3oE ...