Core Insights - The article discusses the introduction of R-HORIZON, a benchmark for evaluating and enhancing the long-chain reasoning capabilities of large reasoning models (LRMs) [8][39] - It highlights the limitations of current training and evaluation paradigms, which primarily focus on isolated single-step problems, failing to address the complexities of real-world reasoning scenarios [4][5] Group 1: Background and Motivation - The transition from "single-step reasoning" to "long-chain decision-making" is emphasized as a critical evolution in AI reasoning capabilities [3] - Existing benchmarks like MATH500 and AIME focus on independent problems, which do not reflect the interconnected nature of real-world reasoning tasks [4] Group 2: R-HORIZON Benchmark - R-HORIZON is the first systematic method and benchmark for assessing and enhancing LRMs' long-chain reasoning abilities [8] - It employs a query composition method to transform isolated tasks into complex multi-step reasoning scenarios, allowing for a more accurate evaluation of model capabilities [11] Group 3: Key Findings - A significant performance drop was observed in top models when faced with long-chain reasoning tasks, indicating a "reasoning cliff" where even advanced models struggle [16] - The benchmark includes six representative datasets covering various reasoning tasks, such as mathematical reasoning and code generation [15] Group 4: Mechanisms and Bottlenecks - Three key bottlenecks were identified in current LRMs: limited effective reasoning length, localized reflection mechanisms, and imbalanced thinking budget allocation [20][23] - The analysis revealed that all models experienced significant performance declines as the number of interdependent problems increased, with larger models showing more resilience [21] Group 5: Training and Performance Improvement - R-HORIZON training demonstrated a dual performance enhancement, improving both long-chain task performance and single problem accuracy [30][33] - The training process led to more efficient reasoning lengths and better token budget allocation across multi-step problems, addressing previous imbalances [34][35] Group 6: Future Directions - The launch of R-HORIZON marks a paradigm shift in LRM research, focusing on the extent of reasoning capabilities rather than just problem-solving abilities [39] - The framework is open-sourced, inviting collaboration from global researchers to advance the development of next-generation reasoning models [40]
R-HORIZON:长程推理时代来临,复旦NLP&美团LongCat重磅发布LRMs能力边界探测新范式
机器之心·2025-10-22 08:46