强化学习
Search documents
Anthropic专家揭秘强化学习突破、算力竞赛与AGI之路 | Jinqiu Select
锦秋集· 2025-05-25 04:19
Core Insights - AI is predicted to complete the workload of a junior engineer by 2026, marking a significant shift in capabilities from code assistance to programming partnership [1][3] - The rapid advancements in AI are driven by reinforcement learning, particularly in programming and mathematics, where clear success criteria exist [3][5] - The transition from "how to find work" to "what to change with tenfold leverage" is crucial as AI becomes a powerful multiplier [4][30] Group 1: AI Development Trajectory - The development of AI has shown an accelerating trend, with significant milestones from GPT-4 in March 2023 to the o1 model in September 2024, which enhances reasoning capabilities [1][3] - The programming domain is leading AI advancements due to immediate feedback loops and high-quality training data [1][3] - The expected "18-24 month capability doubling" pattern suggests a critical point in AI development, aligning with predictions for 2026 [1][3] Group 2: Reinforcement Learning and AI Capabilities - Reinforcement learning is identified as the key to AI breakthroughs, moving from human feedback reinforcement learning (RLHF) to verifiable reward reinforcement learning (RLVR) [3][8] - The quality of feedback loops is crucial for AI performance, with clear reward signals determining the upper limits of AI capabilities [8][10] - AI's rapid progress in verifiable fields like programming contrasts with challenges in subjective areas like literature [9][10] Group 3: Future Predictions and Challenges - By 2026, AI is expected to autonomously handle complex tasks such as Photoshop effects and flight bookings, shifting focus to efficient deployment of multiple agents [21][22] - The bottleneck for AI deployment will be the ability to verify and validate the performance of multiple agents [23][24] - The potential for AI in tax automation is acknowledged, with expectations for basic operations by 2026, though full autonomy remains uncertain [22][25] Group 4: Strategic Considerations for AI - The next decade is critical for achieving AGI breakthroughs, with a significant focus on computational resources and infrastructure [32][34] - Countries must redefine strategic resource allocation, emphasizing computational capacity as a new form of wealth [27][28] - The balance between risk and reward in AI development is essential, requiring large-scale resource allocation for future strategic options [27][28] Group 5: Mechanistic Interpretability and AI Understanding - Mechanistic interpretability aims to reverse-engineer neural networks to understand their core computations, revealing complex internal processes [38][39] - The findings indicate that models can exhibit surprising behaviors, such as "pretending to compute," highlighting the need for deeper understanding of AI actions [39][40] - The challenge of ensuring AI aligns with human values and understanding its decision-making processes remains a critical area of research [42][45]
只用图像也能思考,强化学习造就推理模型新范式!复杂场景规划能力Max
机器之心· 2025-05-25 03:51
例如,模型虽然能够识别图像中的物体并描述它们之间一些相对简单的空间关系,但在追求极致的定位精度,或需要深入理解和预测物体间高度复杂、动态或隐 含的交互逻辑(而非仅仅识别表面现象)时,其表现仍可能因视觉信息在文本化过程中的细节损失而受到限制。 机器之心报道 编辑:Panda、+0 近年来,LLM 及其多模态扩展(MLLM)在多种任务上的推理能力不断提升。然而, 现有 MLLM 主要依赖文本作为表达和构建推理过程的媒介,即便是在处理 视觉信息时也是如此 。 常见的 MLLM 结构。 这种模式要求模型首先将视觉信息「翻译」或「映射」为文本描述或内部的文本化 token,然后再利用大型语言模型的文本推理能力进行处理。 这个转换过程不可避免地可能导致视觉信息中固有的丰富细节、空间关系和动态特征的丢失或削弱,形成了所谓的「模态鸿沟 (modality gap) 」。这种鸿沟不仅限 制了模型对视觉世界的精细感知,也影响了其在复杂视觉场景中进行有效规划的能力。 来自剑桥、伦敦大学学院、谷歌的研究团队认为: 语言不一定始终是进行推理最自然或最有效的模态,尤其是在涉及空间与几何信息的任务场景中 。 基于此动因,研究团队提出了一种 ...
DeepSeek用的GRPO有那么特别吗?万字长文分析四篇精品论文
机器之心· 2025-05-24 03:13
Core Insights - The article discusses recent advancements in reasoning models, particularly focusing on GRPO and its improved algorithms, highlighting the rapid evolution of AI in the context of reinforcement learning and reasoning [1][2][3]. Group 1: Key Papers and Models - Kimi k1.5 is a newly released reasoning model that employs reinforcement learning techniques and emphasizes long context extension and improved strategy optimization [10][17]. - OpenReasonerZero is the first complete reproduction of reinforcement learning training on a foundational model, showcasing significant results [34][36]. - DAPO explores improvements to GRPO to better adapt to reasoning training, presenting a large-scale open-source LLM reinforcement learning system [48][54]. Group 2: GRPO and Its Characteristics - GRPO is closely related to PPO (Proximal Policy Optimization) and shares similarities with RLOO (REINFORCE Leave One Out), indicating that many leading research works do not utilize GRPO [11][12][9]. - The core understanding is that current RL algorithms are highly similar in implementation, with GRPO being popular but not fundamentally revolutionary [15][6]. - GRPO includes clever modifications specifically for reasoning training rather than traditional RLHF scenarios, focusing on generating multiple answers for reasoning tasks [13][12]. Group 3: Training Techniques and Strategies - Kimi k1.5's training involves supervised fine-tuning (SFT) and emphasizes behavior patterns such as planning, evaluation, reflection, and exploration [23][24]. - The training methods include a sequence strategy that starts with simpler tasks and gradually increases complexity, akin to human learning processes [27][28]. - The paper discusses the importance of data distribution and the quality of prompts in ensuring effective reinforcement learning [22][41]. Group 4: DAPO Improvements - DAPO introduces two distinct clipping hyperparameters to enhance the learning dynamics and efficiency of the model [54][60]. - It also emphasizes dynamic sampling by removing samples with flat rewards from the batch to improve learning speed [63]. - The use of token-level loss rather than per-response loss is proposed to better manage learning dynamics and avoid issues with long responses [64][66]. Group 5: Dr. GRPO Modifications - Dr. GRPO aims to improve learning dynamics by modifying GRPO to achieve stronger performance with shorter generated lengths [76][79]. - The modifications include normalizing advantages across all tokens in a response, which helps in managing the learning signal effectively [80][81]. - The paper highlights the importance of high-quality data engineering in absorbing the effects of these changes, emphasizing the need for a balanced distribution of problem difficulty [82][89].
四位图灵奖掌舵,2025智源大会揭示AI进化新路径
量子位· 2025-05-23 06:14
Core Viewpoint - The 2025 Zhiyuan Conference will focus on the intersection of deep learning and reinforcement learning, showcasing advancements in AI and fostering discussions among leading researchers and industry experts [2][3][5]. Group 1: Conference Overview - The 2025 Zhiyuan Conference will take place on June 6-7, 2025, in Beijing, China, and is recognized as a premier academic summit in the field of artificial intelligence [3]. - The conference has attracted 12 Turing Award winners since its inception in 2019 and engages over 200 experts from more than 30 countries, with a total audience of 500,000 professionals [3]. - The event will feature nearly 20 thematic forums covering topics such as deep reasoning models, multimodal models, embodied intelligence, and AI for Science [4]. Group 2: Key Themes and Discussions - The conference will explore four main themes: foundational theories, application exploration, industrial innovation, and sustainable development [4]. - Significant topics include the rise of reasoning models, the acceleration of open-source ecosystems, and the rapid evolution of embodied intelligence [2][4]. - A special "Large Model Industry CEO Forum" will be held, featuring CEOs from leading AI companies discussing the evolution and innovation paths of large models [5]. Group 3: Special Activities - The "InnoVibe Co-Creation Space" will be introduced, allowing authors of popular AI papers to share their latest research, aimed at empowering the new generation of AI talent [5]. - An AI interactive exhibition area will be set up for attendees to experience cutting-edge AI technologies firsthand [5]. - The conference aims to bridge theoretical advancements with real-world challenges, fostering collaboration between academia and industry [5].
我们让GPT玩狼人杀,它特别喜欢杀0号和1号,为什么?
Hu Xiu· 2025-05-23 05:32
Core Viewpoint - The discussion highlights the potential dangers and challenges posed by AI, emphasizing the need for awareness and proactive measures in addressing AI safety issues. Group 1: AI Safety Concerns - AI has inherent issues such as hallucinations and biases, which require serious consideration despite the perception that the risks are distant [10][11]. - The phenomenon of adversarial examples poses significant risks, where slight alterations to inputs can lead AI to make dangerous decisions, such as misinterpreting traffic signs [17][37]. - The existence of adversarial examples is acknowledged, and while they are a concern, many AI applications implement robust detection mechanisms to mitigate risks [38]. Group 2: AI Bias - AI bias is a prevalent issue, illustrated by incidents where AI mislabels individuals based on race or gender, leading to significant social implications [40][45]. - The root causes of AI bias include overconfidence in model predictions and the influence of training data, which often reflects societal biases [64][72]. - Efforts to mitigate bias through data manipulation have limited effectiveness, as inherent societal structures and language usage continue to influence AI outcomes [90][91]. Group 3: Algorithmic Limitations - AI algorithms primarily learn correlations rather than causal relationships, which can lead to flawed decision-making [93][94]. - The reliance on training data that lacks comprehensive representation can exacerbate biases and inaccuracies in AI outputs [132]. Group 4: Future Directions - The concept of value alignment is crucial as AI systems become more advanced, necessitating a deeper understanding of human values to ensure AI actions align with societal norms [128][129]. - Research into scalable oversight and superalignment is ongoing, aiming to develop frameworks that enhance AI's compatibility with human values [130][134]. - The importance of AI safety is increasingly recognized, with initiatives being established to integrate AI safety into public policy discussions [137][139].
四位图灵奖掌舵:2025智源大会揭示AI进化新路径
机器之心· 2025-05-23 04:17
2006 年,多伦多大学 Geoffrey Hinton 教授等人提出逐层预训练方法,突破了深层神经网络训练的 技术瓶颈,为深度学习的复兴奠定了基础。 这个初夏 四位图灵奖得主 强化学习作为智能体与环境交互的学习范式,其核心思想早于深度学习兴起。2013 年 DeepMind 提 出的 DQN 已初步实现深度学习与强化学习的结合,而 2016 年 AlphaGo 的成功则将深度学习与强化 学习的融合推向公众视野,显著提升了这一交叉领域的关注度。 2025 年 6 月 6-7 日 中国,北京 与全球创新力量共赴智源大会 即刻报名,探寻 AI 时代的无尽边域 基础理论 在 AI 发展史上,连接主义(以神经网络为代表)与行为主义(以强化学习为代表)虽源自不同理论脉 络,但二者的技术交叉早有端倪。这两条主线原本独立成长、各自发展,如今交织融合,万宗归一,共 同构成了下一代通用人工智能的基石。 6 月 6 日,关于深度学习和强化学习的探讨,将在 2025 智源大会继续开展,如 「双星交汇 」般的时 空对话,总结过往、共探智能之谜的终极答案。 与此同时,推理大模型的兴起、开源生态的加速、具身智能的百花齐放,成为 2025 ...
5分钟读懂Lilian Weng万字长文:大模型是怎么思考的?
Hu Xiu· 2025-05-22 09:54
Core Insights - The article discusses the latest paradigms in AI, particularly focusing on the concept of "test-time compute" and how large language models (LLMs) can enhance their reasoning capabilities through various methods [3][12][26]. Group 1: AI Paradigms - The blog systematically organizes the latest paradigms in AI, emphasizing "test-time compute" [3]. - LLMs exhibit similarities to human thought processes, drawing parallels with Daniel Kahneman's "Thinking, Fast and Slow" [4][5]. - The reasoning process in LLMs can be likened to human cognitive systems, where "System 1" represents quick, intuitive responses, and "System 2" denotes slower, analytical thinking [6][7]. Group 2: Enhancing Reasoning in LLMs - The concept of "Chain of Thought" (CoT) allows models to allocate variable computational resources based on problem complexity, particularly beneficial for complex reasoning tasks [9]. - Reinforcement learning (RL) has been scaled up in reasoning, with significant changes initiated by OpenAI's developments [14]. - The training process of models like DeepSeek R1 involves parallel sampling and sequential improvement, enhancing the reasoning capabilities of LLMs [15][16]. Group 3: External Tool Utilization - The use of external tools during the reasoning process can improve efficiency and accuracy, such as employing code interpreters for complex calculations [19]. - OpenAI's recent models, o3 and o4-mini, emphasize the importance of tool usage, which marks a paradigm shift in AI development [20][21]. Group 4: Future Research Directions - The article raises open questions for future research, such as improving RNNs to dynamically adjust computation layers and enhancing Transformer architectures for better reasoning [28]. - It also discusses the challenge of training models to generate human-readable CoTs that accurately reflect their reasoning processes while avoiding reward hacking [29][30].
翁荔最新万字长文:Why We Think
量子位· 2025-05-18 05:20
Core Insights - The article discusses the concepts of "Test-time Compute" and "Chain-of-Thought" (CoT) as methods to significantly enhance model performance in artificial intelligence [1][2][6] Group 1: Motivation and Theoretical Background - Allowing models to think longer before providing answers can be achieved through various methods, enhancing their intelligence and overcoming current limitations [2][8] - The core idea is deeply related to human thinking processes, where humans require time to analyze complex problems, aligning with Daniel Kahneman's dual-system theory from "Thinking, Fast and Slow" [10][11] - By consciously slowing down and reflecting, models can engage in more rational decision-making, akin to human System 2 thinking [11][12] Group 2: Computational Resources and Model Architecture - Deep learning views neural networks as capable of accessing computational and storage resources, optimizing their use through gradient descent [13] - In Transformer models, the computational load (flops) for each generated token is approximately double the number of parameters, with sparse models like Mixture of Experts (MoE) utilizing only a fraction of parameters during each forward pass [13] - CoT allows models to perform more computations for each token based on the difficulty of the problem, enabling variable computational loads [13][18] Group 3: CoT and Learning Techniques - Early improvements in CoT involved generating intermediate steps for mathematical problems, with subsequent research showing that reinforcement learning can significantly enhance CoT reasoning capabilities [19][20] - Supervised learning on human-written reasoning paths and appropriate prompts can greatly improve the mathematical abilities of instruction-tuned models [21][23] - The effectiveness of CoT prompts in increasing success rates for solving mathematical problems is more pronounced in larger models [23] Group 4: Sampling and Revision Techniques - The fundamental goal of test-time computation is to adaptively modify the model's output distribution during reasoning [24] - Parallel sampling methods are straightforward but limited by the model's ability to generate correct solutions in one go, while sequential revision requires careful execution to avoid introducing errors [24][25] - Combining both methods can yield optimal results, with simpler problems benefiting from sequential testing and more complex problems performing best with a mix of both approaches [24][25] Group 5: Advanced Techniques and Future Directions - Various advanced algorithms, such as Best-of-N and Beam Search, are employed to optimize the search process for high-scoring samples [29][30] - The RATIONALYST system focuses on synthesizing reasoning based on vast unannotated data, providing implicit and explicit guidance for generating reasoning steps [32][33] - Future challenges include enhancing computational efficiency, integrating self-correction mechanisms, and ensuring the reliability of reasoning outputs [47][50]
刚刚!北大校友Lilian Weng最新博客来了:Why We Think
机器之心· 2025-05-18 04:25
Core Insights - The article discusses advancements in utilizing "thinking time" during model inference, aiming to enhance the reasoning capabilities of AI models like GPT, Claude, and Gemini [2][3][16]. Group 1: Thinking Mechanisms - The concept of "thinking time" is analogous to human cognitive processes, where complex problems require reflection and analysis before arriving at a solution [6]. - Daniel Kahneman's dual process theory categorizes human thinking into fast (System 1) and slow (System 2) modes, emphasizing the importance of slower, more deliberate thought for accurate decision-making [12]. Group 2: Computational Resources - In deep learning, neural networks can be characterized by the computational and storage resources they utilize during each forward pass, impacting their performance [8]. - The efficiency of models can be improved by allowing them to perform more computations during inference, particularly through strategies like Chain of Thought (CoT) prompting [8][18]. Group 3: Chain of Thought (CoT) and Learning Strategies - CoT prompting significantly enhances the success rate of solving mathematical problems, with larger models benefiting more from extended "thinking time" [16]. - Early research focused on supervised learning from human-written reasoning paths, evolving into reinforcement learning strategies that improve CoT reasoning capabilities [14][41]. Group 4: Test-Time Computation Strategies - Two main strategies for improving generation quality are parallel sampling and sequential revision, each with distinct advantages and challenges [19][20]. - Parallel sampling is straightforward but relies on the model's ability to generate correct answers in one go, while sequential revision allows for targeted corrections but is slower [20][21]. Group 5: Reinforcement Learning Applications - Recent studies have successfully employed reinforcement learning to enhance reasoning capabilities in language models, particularly in STEM-related tasks [41][46]. - The training process often involves a cold-start phase followed by reasoning-oriented reinforcement learning, optimizing performance through structured feedback [42][43]. Group 6: External Tools and Integration - Utilizing external tools, such as code interpreters or APIs, can enhance the reasoning process by offloading certain computational tasks [52][56]. - The ReAct method combines external operations with reasoning trajectories, allowing models to incorporate external knowledge into their inference paths [56][57]. Group 7: Model Interpretability and Trustworthiness - The article highlights the importance of model interpretability, particularly through CoT, which allows for monitoring and understanding model behavior [59]. - However, there are concerns regarding the fidelity of CoT outputs, as biases and errors can affect the reliability of the reasoning process [62][64]. Group 8: Adaptive Computation and Token Utilization - Adaptive computation time allows models to dynamically adjust the number of computation steps during inference, enhancing their reasoning capabilities [81]. - Introducing special tokens, such as thinking tokens, can provide additional processing time and improve model performance on complex tasks [85][89].
通义实验室新研究:大模型自己「扮演」搜索引擎,提升推理能力无需搜索API
量子位· 2025-05-17 03:50
Core Insights - The article discusses the introduction of ZeroSearch, an open-source reinforcement learning framework developed by Alibaba's Tongyi Laboratory, which enhances the search capabilities of large language models (LLMs) without relying on real search engines [4][19]. Group 1: Challenges in Current Approaches - Current search engines produce unpredictable document quality, introducing noise and instability into the training process [2]. - Reinforcement learning (RL) training requires frequent deployments, leading to significant API costs that limit scalability [3]. Group 2: ZeroSearch Solution - ZeroSearch eliminates the need for interaction with real search engines, thus avoiding API costs and making large-scale RL training more economically feasible [19][36]. - The framework allows LLMs to become self-sufficient in search evolution through a simulated search environment and progressive noise-resistant training [6][19]. Group 3: Training Methodology - ZeroSearch employs lightweight fine-tuning to transform LLMs into "search engine simulators," enabling them to generate both useful results and noise interference with minimal labeled data [7][10]. - A curriculum-based noise training approach is introduced, where the model initially returns high-quality documents and gradually incorporates noise, enhancing training stability and effectiveness [12][14]. Group 4: Performance Metrics - Experimental results indicate that ZeroSearch requires only a 3 billion parameter LLM to significantly improve search capabilities while saving on API costs [5]. - ZeroSearch outperforms existing methods in both single-hop and multi-hop question-answering tasks, demonstrating superior retrieval capabilities [25][26]. Group 5: Compatibility with RL Algorithms - ZeroSearch is compatible with various RL algorithms, including Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), providing flexibility in training strategies [19][20]. - GRPO shows better training stability, while PPO offers higher flexibility in certain tasks, indicating that ZeroSearch can adapt to different algorithmic needs [21][34]. Group 6: Future Implications - The innovative approach of ZeroSearch addresses cost and stability issues present in current methods, paving the way for future advancements in intelligent retrieval systems [37].