Subliminal Learning

Search documents
X @Anthropic
Anthropic· 2025-07-22 16:32
Subliminal learning can occur for benign traits (such as liking eagles) or more concerning traits (such as misalignment). This has consequences for training on model-generated data.Read more on our Alignment Science blog: https://t.co/BWbgK82P02 https://t.co/sPfm6WC3JA ...
X @Anthropic
Anthropic· 2025-07-22 16:32
In a joint paper with @OwainEvans_UK as part of the Anthropic Fellows Program, we study a surprising phenomenon: subliminal learning.Language models can transmit their traits to other models, even in what appears to be meaningless data.https://t.co/oeRbosmsbHOwain Evans (@OwainEvans_UK):New paper & surprising result.LLMs transmit traits to other models via hidden signals in data.Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵 https://t.co/ewIxfzXOe3 ...