谁说Scaling Law到头了？新研究：每一步的微小提升会带来指数级增长

Core Insights - The Scaling Law is being questioned due to perceived diminishing returns in model training, but recent research suggests that small improvements in accuracy can lead to exponential growth in task completion length, which may hold more economic value in real-world applications [1][2][4] Group 1: Research Findings - A recent paper from Cambridge University indicates that while there are diminishing returns in metrics like test loss, the real-world value of large language models (LLMs) often comes from their ability to complete longer tasks [2][4] - The paper highlights that the long-term execution of tasks has been a significant weakness in deep learning, with LLMs struggling to perform complex, lengthy tasks despite improvements in reasoning capabilities [4][6] - The authors propose that the failures in long tasks are primarily due to execution challenges rather than reasoning or planning limitations, emphasizing the need for more focus on execution capabilities in LLM research [6][20] Group 2: Experimental Insights - The study measures LLMs' long-horizon execution capabilities by isolating execution from planning and knowledge retrieval, revealing that larger models can significantly increase the number of successful execution rounds [6][23][25] - The concept of self-conditioning is introduced, where the model's performance deteriorates as it builds on its previous errors, leading to a decline in accuracy over multiple rounds [8][26][30] - The research shows that while increasing model size improves task execution, it does not alleviate the self-conditioning effect, which remains a challenge for LLMs in long-term tasks [27][30] Group 3: Implications for Investment - The findings suggest that the economic value of LLMs may not be accurately reflected in short-task benchmarks, as the ability to complete longer tasks is a more reliable indicator of their potential [18][20] - The paper encourages further investment in scaling models, as the ability to perform longer tasks could justify continued financial commitment despite short-term performance metrics suggesting stagnation [10][18] - The research calls for the design of new benchmarks that better assess the execution depth of models, highlighting a potential area for future investment and development in the AI sector [10][18]