Workflow
长链推理
icon
Search documents
Gemini 3 发布后的几点思考
傅里叶的猫· 2025-11-21 10:52
Core Insights - The latest generation of AI models has significantly improved in reasoning capabilities and multi-modal understanding, making them more effective for complex tasks [5][6] - The pricing strategy of Google has shifted towards premium pricing for top-tier capabilities, contrasting with OpenAI's cost-cutting approach [7][8] - There remains a notable gap between domestic and international models, particularly in multi-modal capabilities, which may take 6-12 months to bridge [9] Group 1: Model Capabilities - The new generation of AI models excels in long-chain reasoning and multi-modal tasks, reducing hallucinations and improving coding capabilities [5] - Tools focused on coding, like Cursor, face significant pressure due to the advanced capabilities of Gemini 3, which outperforms in quality and speed [6] Group 2: Pricing and Market Strategy - Google's pricing has increased due to the higher computational costs associated with advanced reasoning and multi-modal capabilities, as opposed to a strategy of subsidizing market entry [7] - The company aims to monetize through advertising, subscription services, and enterprise solutions, leveraging its existing account systems for consumer tools [10] Group 3: Domestic vs. International Models - While text-based capabilities are nearing parity, significant gaps remain in dynamic interaction and 3D cognition, primarily due to differences in computational power and training experience [9] - For basic tasks, domestic models are sufficient, but for advanced applications like real-time UI and complex video understanding, international models like Gemini or Claude are still necessary [11]
R-HORIZON:长程推理时代来临,复旦NLP&美团LongCat重磅发布LRMs能力边界探测新范式
机器之心· 2025-10-22 08:46
Core Insights - The article discusses the introduction of R-HORIZON, a benchmark for evaluating and enhancing the long-chain reasoning capabilities of large reasoning models (LRMs) [8][39] - It highlights the limitations of current training and evaluation paradigms, which primarily focus on isolated single-step problems, failing to address the complexities of real-world reasoning scenarios [4][5] Group 1: Background and Motivation - The transition from "single-step reasoning" to "long-chain decision-making" is emphasized as a critical evolution in AI reasoning capabilities [3] - Existing benchmarks like MATH500 and AIME focus on independent problems, which do not reflect the interconnected nature of real-world reasoning tasks [4] Group 2: R-HORIZON Benchmark - R-HORIZON is the first systematic method and benchmark for assessing and enhancing LRMs' long-chain reasoning abilities [8] - It employs a query composition method to transform isolated tasks into complex multi-step reasoning scenarios, allowing for a more accurate evaluation of model capabilities [11] Group 3: Key Findings - A significant performance drop was observed in top models when faced with long-chain reasoning tasks, indicating a "reasoning cliff" where even advanced models struggle [16] - The benchmark includes six representative datasets covering various reasoning tasks, such as mathematical reasoning and code generation [15] Group 4: Mechanisms and Bottlenecks - Three key bottlenecks were identified in current LRMs: limited effective reasoning length, localized reflection mechanisms, and imbalanced thinking budget allocation [20][23] - The analysis revealed that all models experienced significant performance declines as the number of interdependent problems increased, with larger models showing more resilience [21] Group 5: Training and Performance Improvement - R-HORIZON training demonstrated a dual performance enhancement, improving both long-chain task performance and single problem accuracy [30][33] - The training process led to more efficient reasoning lengths and better token budget allocation across multi-step problems, addressing previous imbalances [34][35] Group 6: Future Directions - The launch of R-HORIZON marks a paradigm shift in LRM research, focusing on the extent of reasoning capabilities rather than just problem-solving abilities [39] - The framework is open-sourced, inviting collaboration from global researchers to advance the development of next-generation reasoning models [40]
大模型越反思越错,原来是长链推理通过自我说服加重幻觉 | 北邮
量子位· 2025-07-03 04:26
Core Viewpoint - The article discusses the phenomenon of "hallucination" in long-chain reasoning models, revealing that as the reasoning chain extends, the rate of hallucinations increases significantly, indicating a critical flaw in the models' ability to self-correct and maintain accuracy [1][3][13]. Group 1: Research Findings - A research team from Beijing University of Posts and Telecommunications quantitatively demonstrated the "more thinking, more errors" phenomenon through a "thinking chain audit experiment" [2][3]. - The study found that in long-chain reasoning, reflection does not serve as a correction mechanism but rather legitimizes hallucinations, allowing the model to alter definitions to maintain semantic consistency with user prompts [2][3][13]. - Errors in long-chain reasoning are not isolated incidents but tend to amplify along the reasoning chain, leading to a "snowball effect" of inaccuracies [3][4]. Group 2: Methodology - The research team constructed a controlled knowledge domain based on RFC protocol documents, generating long-chain reasoning of 30-60 steps and inserting reflection operations to track confidence level changes in real-time [7][10]. - The controlled knowledge domain was designed to capture two types of hallucination cases, ensuring reliable reproduction of hallucinations in a controlled environment [9][11]. - The study employed a modeling system that tracks how knowledge is introduced, feedback is provided, and knowledge is refined across multiple reasoning steps, addressing the challenge of studying hallucination evolution in complex reasoning trajectories [10][12]. Group 3: Experimental Results - The experiments revealed that when models encounter embedded errors, 55.9% trigger internal knowledge fabrication processes [20]. - Reflection processes in long-chain reasoning devolve into self-persuasion tools, where models reinforce incorrect answers rather than approaching the truth [21][25]. - The evaluation of seven mainstream detection methods showed that existing interventions are insufficient to fundamentally eliminate hallucination phenomena, with the best method achieving only 79% accuracy [27][30].