系统2思考
Search documents
ICML 2025 Oral | 从「浅对齐」到「深思熟虑」,清华牵头搭起大模型安全的下一级阶梯
机器之心· 2025-06-25 04:06
Core Viewpoint - The article emphasizes the necessity of "safety alignment" in large language models (LLMs) as they are increasingly deployed in high-risk applications. It critiques current shallow alignment methods and introduces a new framework, STAIR, which enhances model safety without sacrificing performance [2][4]. Group 1: Introduction of STAIR Framework - The STAIR framework integrates System 2 thinking into safety alignment, moving beyond superficial responses to risk prompts. It aims to teach models to analyze risks deeply rather than merely refusing requests [4][10]. - STAIR enhances the alignment process through a three-step approach, significantly improving the robustness of open-source models against jailbreak attacks while maintaining their general capabilities [4][30]. Group 2: Three-Step Process of STAIR - **Stage 1: Structured Reasoning Alignment** This stage involves supervised fine-tuning using structured reasoning chain data, enabling the model to develop initial reasoning capabilities. The model learns to analyze risks step-by-step before providing a response [15][16]. - **Stage 2: Safety-Informed Monte Carlo Tree Search** This stage employs a Monte Carlo tree search to create self-sampled data pairs, optimizing the model's safety and general capabilities. The reward function is designed to prioritize safety while ensuring usefulness [17][24]. - **Stage 3: Test-Time Scaling** The final stage trains a reward model to guide the language model during testing, enhancing performance through Best-of-N or beam search techniques. This stage has shown significant improvements in safety scores compared to mainstream commercial models [29][30]. Group 3: RealSafe-R1 Model - Building on the STAIR framework, the RealSafe-R1 model targets the open-source DeepSeek-R1 model for safety alignment. It constructs 15,000 safety-aware reasoning trajectories, significantly enhancing safety without compromising reasoning capabilities [32][34]. - The model's training process emphasizes safety risk awareness during reasoning, leading to substantial improvements in safety while maintaining performance in various reasoning tasks [34][35].