清华、快手提出AttnRL：让大模型用「注意力」探索

Core Insights - The article discusses the advancements in reinforcement learning (RL), particularly focusing on Process-Supervised RL (PSRL) and the introduction of a new framework called AttnRL, which enhances exploration efficiency and performance in reasoning models [3][4][9]. Group 1: Challenges in Traditional Methods - Traditional PSRL methods assign equal reward signals to all tokens, neglecting the fine-grained quality during the reasoning process [7]. - Existing PSRL approaches face significant bottlenecks in exploration efficiency and training costs, leading to high computational expenses [4][10]. Group 2: Introduction of AttnRL - AttnRL introduces an innovative exploration method by utilizing attention mechanisms to guide the reasoning process, allowing the model to branch from high-attention steps [9][12]. - The framework employs Attention-based Tree Branching (ATB), which analyzes the reasoning sequence and calculates Forward Context Influence (FCI) scores to determine the most impactful steps for branching [13][16]. Group 3: Adaptive Sampling Mechanisms - AttnRL incorporates two adaptive sampling mechanisms: difficulty-aware exploration and dynamic batch adjustment, optimizing the learning process by focusing on challenging problems while reducing computational load on simpler ones [20][22]. - The training process is streamlined to a One-Step Off-Policy approach, significantly reducing sampling costs compared to previous PSRL methods [23]. Group 4: Experimental Results - AttnRL demonstrates superior performance across various mathematical reasoning benchmarks, achieving average accuracy rates of 57.2% for 1.5B models and 68.7% for 7B models, outperforming baseline methods like GRPO and TreeRL [28]. - The framework shows improved efficiency in sampling, with a higher effective ratio and better performance in fewer training steps compared to traditional methods [29][31]. Group 5: Future Outlook - The introduction of attention scores in PSRL exploration decisions opens new avenues for enhancing model interpretability and RL research, suggesting that efficiency and intelligence can coexist through more effective exploration strategies [34].