字节&MAP重塑大模型推理算法优化重点，强化学习重在高效探索助力LLM提升上限

Core Viewpoint - The article discusses the limitations of traditional reinforcement learning (RL) frameworks in large language models (LLMs), particularly the issue of premature convergence leading to a lack of exploration and diversity in generated outputs [1][2]. Group 1: Introduction to FR3E - The FR3E framework, inspired by the concept of "First Return, Then Explore," aims to address the exploration challenges in RL by balancing exploitation and exploration [2][4]. - This new structured exploration framework is developed by a collaborative team from ByteDance, MAP, and the University of Manchester [2][5]. Group 2: Algorithm Framework - The FR3E algorithm consists of two phases: First Return and Entropy-Eliciting Explore [10][14]. - In the First Return phase, the model performs multiple rollouts for each prompt, exploring potential solutions and collecting trajectories and reward signals [12]. - The Entropy-Eliciting Explore phase utilizes a dynamic advantage modulation mechanism to fine-tune learning signals based on the marginal improvement in value from one state to another [16][18]. Group 3: Data Construction - The team employs a mixed difficulty strategy for data construction, using low-difficulty data for stable training and high-difficulty data to challenge the model's reasoning capabilities [23]. Group 4: Experimental Results - The effectiveness of FR3E was evaluated across several authoritative mathematical reasoning benchmarks, including GSM8K, Math500, and others, using various model sizes [24]. - FR3E outperformed the strong baseline GRPO++ across multiple benchmarks, demonstrating superior generalization and reasoning capabilities [25][28]. - Notably, FR3E exhibited prolonged exploration behavior, with slower entropy decay and longer response lengths, successfully overcoming the "stagnation" issue seen in traditional methods [26][27]. Group 5: Conclusion - FR3E presents an innovative and efficient structured exploration paradigm that directly addresses the core bottleneck of insufficient exploration in LLMs [28]. - The method's principles of "structured feedback + adaptive adjustment" show promising scalability and potential for future RL training in large models [29].