不是玄学！港科大清华等联手：撕开推理黑箱，RL让AI像人思考

Core Insights - The article discusses the recent research by teams from Hong Kong University of Science and Technology, University of Waterloo, and Tsinghua University, which reveals that large language models (LLMs) learn reasoning in a human-like manner by separating high-level strategy planning from low-level execution [3][10][12]. Group 1: Reinforcement Learning and LLMs - Reinforcement Learning (RL) enhances the reasoning capabilities of LLMs, although the underlying mechanisms have not been clearly understood until now [2][5]. - The research highlights the importance of RL in enabling models to exhibit reflective behaviors during interactions with the RL environment [7][10]. - Two significant experimental clues are identified: "length scaling effect" and "aha moment," indicating that LLMs can learn to use more thinking time to solve reasoning tasks [8][9][10]. Group 2: Learning Dynamics - The study outlines a two-phase learning dynamic in LLMs during RL training: the first phase focuses on consolidating basic execution skills, while the second phase shifts towards exploring high-level planning strategies [14][22]. - In the first phase, the model's focus is on mastering low-level operations, which is marked by a decrease in the uncertainty of execution tokens [23][24]. - The second phase involves the model actively expanding its strategy planning library, which correlates with improved reasoning accuracy and longer solution chains [28][30]. Group 3: HICRA Algorithm - The research introduces a new algorithm called HICRA (Hierarchy-Aware Credit Assignment), which emphasizes the learning of planning tokens over execution tokens to enhance reasoning capabilities [18][42]. - HICRA consistently outperforms mainstream methods like GRPO, particularly when the model has a solid foundation in execution skills [20][45]. - Experimental results show that HICRA leads to significant improvements in various reasoning benchmarks compared to GRPO, indicating its effectiveness in optimizing planning tokens [46][47]. Group 4: Insights on Token Dynamics - The study reveals that the observed phenomena, such as "aha moments" and "length scaling," are not random but are indicative of a structured learning process [33][35]. - The overall token-level entropy decreases as the model becomes more predictable in executing low-level tasks, while the semantic entropy of planning tokens increases, reflecting the model's exploration of new strategies [39][40]. - The findings suggest that the key to enhancing reasoning capabilities lies in improving planning abilities rather than merely optimizing execution details [20][41].