X @Anthropic - Reportify

It also suggests that alignment work should focus more on reward hacking and goal misgeneralization during training, and less on preventing the relentless pursuit of a goal the model was not trained on.Read the full paper: https://t.co/rQ7921uGrk ...