X @Anthropic
Anthropic·2026-01-19 21:04
Persona-based jailbreaks work by prompting models to adopt harmful characters. We developed a technique for constraining models' activations along the Assistant Axis—“activation capping”. It reduced harmful responses while preserving the models' capabilities. https://t.co/NJ83M37tMK ...