长时程任务执行能力
Search documents
剑桥揭开大模型翻车黑箱,别再怪它不懂推理,是行动出错了
3 6 Ke· 2025-10-13 10:46
Core Insights - The core argument of the article is that the challenges faced by large models in executing long-term tasks are not primarily due to reasoning capabilities but rather stem from their execution abilities [1][6][20]. Group 1: Execution Challenges - Large models exhibit a phenomenon where their performance declines as the length of tasks increases, indicating that execution stability is a critical area that requires more focus [6][14]. - The study highlights that even with improved single-step accuracy, the overall task execution can suffer due to a decrease in accuracy over multiple steps, a phenomenon referred to as self-conditioning [3][33]. - Researchers emphasize that the ability to execute plans reliably is essential, especially as the industry moves towards developing intelligent agents capable of handling entire projects rather than isolated problems [4][6]. Group 2: Performance Metrics - The researchers propose several metrics to evaluate the performance of large models, including Step Accuracy, Turn Accuracy, Turn Complexity, Task Accuracy, and Horizon Length [7][12]. - The findings indicate that as the number of steps in a task increases, the accuracy of the model tends to decline, which is critical for understanding the limitations of current models [9][31]. - The study reveals that larger models tend to maintain higher task accuracy over more rounds, suggesting that scaling up model size can enhance execution capabilities [32][36]. Group 3: Self-Conditioning Effect - The self-conditioning effect is identified as a significant factor contributing to the decline in accuracy during long-term tasks, where previous errors can lead to a higher likelihood of future mistakes [33][35]. - Experiments show that even with perfect knowledge and planning, models can still fail in long-chain tasks due to unstable execution [20][28]. - The research indicates that simply increasing model size does not alleviate the self-conditioning issue, which remains a challenge for long-term execution [36][37]. Group 4: Thinking Models - The article discusses the advantages of "thinking" models, which demonstrate improved resilience against self-conditioning and can execute longer tasks in a single round [43]. - These models, such as Qwen3 with thinking capabilities, show a significant improvement in task execution length compared to their non-thinking counterparts [43]. - The findings support the notion that a structured approach of reasoning before action can enhance the performance of large models in complex tasks [43].