Workflow
LLM(大语言模型)
icon
Search documents
谁说Scaling Law到头了?新研究:每一步的微小提升会带来指数级增长
机器之心· 2025-09-16 04:01
Core Viewpoint - The article discusses the ongoing debate regarding the diminishing returns of scaling models in AI, particularly in the context of large language models (LLMs). It presents a new perspective that, despite slower improvements in single-step accuracy, these incremental gains can lead to exponential growth in task completion length, which may hold greater economic value in real-world applications [1][3]. Group 1: Scaling Law and Economic Value - The scaling law indicates that while there may be diminishing returns in metrics like test loss, the real-world value of LLMs often comes from their ability to complete longer tasks. Larger models can compound small improvements in single-step accuracy, resulting in exponential increases in task length [3][6]. - The paper titled "The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs" argues that the economic value of an AI agent is derived from the length of tasks it can complete, rather than short task benchmarks that may suggest stagnation in progress [5][19]. Group 2: Long-Horizon Execution Challenges - Long-term task execution has historically been a significant weakness for deep learning models. The paper highlights that while LLMs have improved in complex reasoning tasks, they still struggle with executing longer tasks reliably [6][11]. - The authors propose that failures in long-term execution are often misattributed to reasoning or planning deficiencies, when in fact, execution remains a critical and under-researched challenge [7][22]. Group 3: Self-Conditioning Effect - The study identifies a self-conditioning effect where the error rate in long tasks increases with each step, leading to a compounding effect of mistakes. This phenomenon contrasts with human performance, where practice typically leads to improvement [9][30]. - The authors found that larger models do not necessarily mitigate the self-conditioning effect, which can lead to a decline in performance over extended tasks [29][32]. Group 4: Impact of Thinking Models - Recent thinking models have shown the ability to correct for self-conditioning limitations, allowing for significantly longer task execution in single rounds. For instance, the GPT-5 thinking version can execute over 1000 steps, far surpassing competitors [10][36]. - The research emphasizes the importance of reasoning before action, as models that utilize thinking chains can perform better in executing longer tasks compared to those that do not [36][37]. Group 5: Experimental Insights - The experiments conducted reveal that increasing model size significantly enhances the number of rounds a model can successfully execute, demonstrating a clear scaling trend [27][28]. - The findings suggest that while larger models can improve task execution, they still face challenges due to self-conditioning, which remains a critical area for future research [29][37].
从 SEAL 自适应学习到 DFT 奖励矫正,LLM 泛化能力的实质提升又有多少?
机器之心· 2025-09-07 01:30
Core Insights - The article discusses the challenges and advancements in the generalization capabilities of large language models (LLMs), highlighting various strategies to improve these capabilities, such as adaptive fine-tuning and dynamic gradient adjustment [7][11]. Group 1: Generalization in LLMs - Generalization in AI refers to a model's ability to apply learned knowledge to new, unseen scenarios, distinguishing it from mere memorization of training data [8]. - Recent studies indicate that as the complexity and scale of models increase, the understanding of "generalization" is being questioned, with some suggesting it may be a form of data memorization rather than true abstraction [9][10]. - Research shows that while increasing model size can enhance performance on reasoning tasks, it may also lead to stronger memorization of factual knowledge, raising concerns about the true nature of generalization [9][10]. Group 2: CoT Reasoning and Its Limitations - Chain-of-Thought (CoT) reasoning has been criticized for its fragility, as performance drops significantly when tested outside the training distribution, suggesting reliance on memory rather than genuine logical reasoning [10]. - Some experts argue that what is perceived as generalization may simply be the result of training data sufficiently covering the test scenarios, challenging the notion of true generalization [10]. Group 3: Research Trends and Focus Areas - The volume of research related to LLMs has surged, with a nearly sixfold increase in relevant studies from 2022 to 2025, particularly focusing on reasoning, generalization, and model safety [11]. - Recent research has shifted from merely examining data distribution and model size to exploring training strategies, model update mechanisms, and data design to enhance generalization capabilities [11].
Jinqiu Select | Physical Intelligence 联创:AI训练的真实数据不可替代
锦秋集· 2025-07-22 15:04
Core Viewpoint - Over-reliance on alternative data sources can severely limit the ultimate capabilities of models, and true breakthroughs must be built on real data [1][10] Group 1: The Dilemma of Alternative Data - Researchers in robotics often seek cheaper alternatives to real data due to high collection costs, leading to a compromise in model performance [2][3] - Common alternative methods include simulation training, learning from human videos, and using handheld devices to mimic robotic actions, but each method ultimately weakens the model's true potential [3][4] Group 2: Intersection Dilemma - The collection of data inevitably involves human judgment, which can limit the problem-solving approach when avoiding real data [4][6] - As models grow stronger, they can better distinguish between alternative and real data, leading to a smaller intersection of effective behaviors [6][7] Group 3: The Importance of Real Data - Attempting to bypass real data results in a "spork" scenario, where neither alternative data nor real data is effectively utilized [10][11] - To build robust robotic models that generalize well, real data is essential, but it can be complemented with diverse data sources [11][12] Group 4: The "Spork" Phenomenon - The concept of "spork" applies to various AI research areas, where attempts to combine manual design with learning systems ultimately create performance bottlenecks [13]
概率统计机制下,LLM 推理真的「理解世界了」吗?
机器之心· 2025-06-21 06:32
Group 1 - The article discusses whether LLMs (Large Language Models) truly "understand the world" or if their reasoning is merely a form of pattern matching, highlighting the debate within the industry regarding the nature of LLM reasoning capabilities [1][3][4] - It references a paper from Apple suggesting that current reasoning models do not genuinely think, but rather engage in pattern matching, which has sparked renewed discussion in the AI community [3][4] - The article mentions that true reasoning involves understanding causal relationships, as emphasized by various researchers, indicating that LLMs lack a causal framework necessary for deep and flexible reasoning [5][6][7] Group 2 - The article explores the motivations behind enterprises increasing their spending on generative AI, noting a shift from building in-house solutions to purchasing third-party AI applications [1][2] - It outlines the evaluation framework for selecting AI models, which includes key factors that influence procurement decisions in the context of traditional software purchasing characteristics [1][2]