X @Anthropic
Anthropic·2026-02-03 00:26
It also suggests that alignment work should focus more on reward hacking and goal misgeneralization during training, and less on preventing the relentless pursuit of a goal the model was not trained on.Read the full paper: https://t.co/rQ7921uGrk ...