X @Anthropic - Reportify

Persona-based jailbreaks work by prompting models to adopt harmful characters. We developed a technique for constraining models' activations along the Assistant Axis—“activation capping”. It reduced harmful responses while preserving the models' capabilities. https://t.co/NJ83M37tMK ...