强化学习
Search documents
边学边练,推理觉醒:LUFFY让强化学习即学即用!
机器之心· 2025-05-05 03:40
破解 "只学不练" 与 "只练不学" 的难题 想象你准备参加一场高水平的数学竞赛。如果你只是反复背诵往年题目的标准答案,从不亲自动手解题,那么一旦遇到新题型,很可能束手无策;反过来,如果 你闭门造车,只凭自己反复试错而从不参考老师和高手的解题经验,进步又会异常缓慢。这就好比 AI 模型 训练中长期存在的两种极端: 「 模仿学习 」 只顾照搬 示范却缺乏自我实践, 「强化学习 」 一味自我探索却不借鉴现有经验。 这两种 「只学不练 」 和 「只练不学 」 的策略各有弊端:前者往往学得快但 泛化差 ,后者可能探索勤但 效率低 。那么,有没有两全其美的办法,让模型既能借 鉴高手经验又能保持自主探索?最近,上海 AI 实验室联合西湖大学、南京大学和香港中文大学的研究团队提出了一种全新的强化学习范式: LUFFY(Learning to reason Under oFF-policY guidance) 。 论文链接:https://arxiv.org/abs/2504.14945 代码仓库:https://github.com/ElliottYan/LUFFY 图表 1. 在六项竞赛级数学推理基准上的整体表现。在 A ...
梁文锋和杨植麟再“撞车”
虎嗅APP· 2025-05-04 08:29
Core Viewpoint - The article discusses the competitive landscape of large model development in China, focusing on the advancements and challenges faced by companies like DeepSeek and Kimi, as well as the impact of larger tech firms like Alibaba and Tencent on the market [2][4][12]. Group 1: Model Developments - DeepSeek launched its new model, DeepSeek-Prover-V2, with a parameter scale of 671 billion, significantly larger than the previous version's 7 billion, resulting in improved efficiency and accuracy in mathematical tasks [2][9]. - Kimi, developed by the Moonlight team, also released a model for formal theorem proving, with a smaller parameter scale of 1.5 billion and 7 billion, achieving an 80.7% pass rate in miniF2F tests [2][3]. - The evolution of DeepSeek's models is synchronized, with a timeline of updates from Prover series models starting in March 2024 to the latest Prover-V2 in April 2025 [8][9]. Group 2: Competitive Landscape - DeepSeek faces increasing competition from Alibaba's new model Qwen3, which is touted as a hybrid reasoning model with superior performance despite having only one-third the parameters of DeepSeek's R1 model [14][15]. - Kimi has seen rapid growth, reaching 20 million monthly active users within a year, but is now challenged by Tencent's Yuanbao, which has surpassed Kimi in user numbers due to aggressive marketing [12][13]. - The article highlights the need for multiple leading models in the Chinese market, suggesting that competition and innovation should be encouraged rather than focusing on a single dominant player [14][15]. Group 3: Future Directions - DeepSeek's founder has indicated a focus on three paths for achieving AGI: mathematics and code, multimodal learning, and natural language processing, viewing mathematics as a verifiable system for high intelligence [7]. - The upcoming R2 model is expected to enhance reinforcement learning capabilities, while the V4 model may involve a longer development cycle due to significant changes in pre-training methods [10][11].
机器人领域新突破!顶刊《IJRR》近期重磅论文概述
机器人大讲堂· 2025-05-03 08:04
Group 1 - The article reviews seven selected papers published in the International Journal of Robotics Research, covering various research directions in robotics such as soft actuators, human-robot interaction, dual-arm robots, multi-robot systems, and bipedal locomotion control [1][6][18][27][38][48][58] Group 2 - A new low-profile soft rotary pneumatic actuator was designed, addressing the limitations of traditional soft pneumatic actuators in confined spaces and providing a compact solution for applications in wearable devices and biomedical devices [2][4] - The THÖR-MAGNI dataset was introduced to overcome data bottlenecks in social navigation and human-robot interaction, featuring multi-modal data collection and extensive scene coverage [6][11][14] - The FMB benchmark was developed to standardize robotic manipulation research, offering a diverse set of objects and a modular framework for imitation learning [18][20][24][26] - A framework for dual-arm robots to manipulate deformable linear objects in constrained environments was proposed, achieving high success rates in complex tasks [27][30][31][34] - A real-time planning method for large heterogeneous multi-robot systems was introduced, significantly improving computational efficiency and robustness in dynamic environments [38][40][45] - A survey on communicating robot learning during human-robot interaction highlighted the importance of a closed-loop communication framework to enhance collaboration [48][50][53][55] - Reinforcement learning was applied to bipedal locomotion control, demonstrating significant advancements in adaptability and robustness in complex environments [58][60][62]
OpenAI最新技术报告:GPT-4o变谄媚的原因万万没想到
量子位· 2025-05-03 04:05
一水 发自 凹非寺 量子位 | 公众号 QbitAI GPT-4o更新后"变谄媚"?后续技术报告来了。 OpenAI一篇新鲜出炉的认错小作文,直接引来上百万网友围观。 CEO奥特曼也做足姿态,第一时间转发小作文并表示: (新报告) 揭示了GPT-4o更新失败是因为什么,从中OpenAI学到了什么,以及我们将会采取的应对措施是什么。 概括而言,最新报告提到,大约一周前的bug原来出在了"强化学习"身上—— 上次更新 引入了一个基于用户反馈的额外奖励信号 ,即对ChatGPT的点赞或点踩。 虽然这个信号通常很有用,但可能使模型逐渐倾向于做出更令人愉快的回应。 此外,尽管还没有明确证据,但 用户记忆在某些情况下也可能加剧奉承行为的影响。 一言以蔽之,OpenAI认为一些单独看可能对改进模型有益的举措,结合起来后却共同导致了模型变得"谄媚"。 而在看到这篇报告后,目前大多数网友的反应be like: (你小汁) 认错态度不错~ 甚至有人表示,这算得上OpenAI过去几年里最详细的报告了。 具体咋回事儿?接下来一起吃瓜。 完整事件回顾 4月25日,OpenAI对GPT-4o进行了一次更新。 在官网的更新日志中,当时提到 ...
DeepSeek新数学模型刷爆记录!7B小模型自主发现671B模型不会的新技能
量子位· 2025-05-01 03:53
DeepSeek放大招!新模型专注数学定理证明,大幅刷新多项高难基准测试。 在普特南测试上, 新模型 DeepSeek-Prover-V2 直接把记录刷新到 49道 。 目前的 第一名 在657道题中只做出 10道 题,为Kimi与 AIME2024冠军团队Numina 合作成果 Kimina-Prover 。 而未针对定理证明优化的 DeepSeek-R1只做出 1道 。 让还没发布的R2更令人期待了。 | 657) | | --- | | (out of | | Lean | | मै | Model | num- | | | --- | --- | --- | --- | | | | solved | compute | | 1 | Kimina-Prover-7B-Distill♥ | 10 | pass@192 | | 2 | Self-play Theorem Prover♥ | 8 | pass@3200 | | 3 | Goedel-Prover-SFT♥ | 7 | pass@512 | | 4 | ABEL | 7 | pass@596 | | 5 | InternLM2.5-StepPr ...
后训练时代如何延续Scaling Law?这是你该读的LLM后训练综述
机器之心· 2025-05-01 02:11
Core Insights - The article discusses the significance of post-training techniques such as fine-tuning and reinforcement learning (RL) in enhancing the capabilities of large language models (LLMs) [1][2][5]. Summary by Sections Overview of LLM Post-Training - A recent review report on LLM post-training has gained positive feedback, compiling a resource library of related papers and tools that has received over 700 stars [2]. - The review includes contributions from institutions like UAE University of Artificial Intelligence, University of Central Florida, Google DeepMind, and the University of Oxford, covering techniques to enhance LLMs through RL, supervised fine-tuning, and evaluation benchmarks [2]. Challenges in LLMs - Despite advancements, LLMs face issues such as generating misleading content (referred to as "hallucinations") and maintaining logical consistency in longer conversations [5]. - The reasoning capabilities of LLMs are debated, as they operate on implicit statistical patterns rather than explicit logical reasoning, which can lead to difficulties in simple logical tasks [5]. Training Phases of LLMs - The training process of LLMs is divided into two main phases: pre-training and post-training. Pre-training focuses on next-token prediction using large datasets, while post-training involves multiple rounds of fine-tuning and alignment to improve model behavior and reduce biases [6]. Fine-Tuning Techniques - Fine-tuning is essential for adapting pre-trained LLMs to specific tasks, enhancing their performance in areas like sentiment analysis and medical diagnosis. However, it carries risks of overfitting and high computational costs [7][10]. - Efficient techniques like Low-Rank Adaptation (LoRA) and adapters can reduce computational overhead while allowing models to specialize in specific tasks [10]. Reinforcement Learning in LLMs - RL is introduced to improve LLM adaptability through dynamic feedback and optimization of sequential decisions. This differs from traditional RL settings, as LLMs select tokens from a vast vocabulary rather than a limited action space [9][11]. - The feedback in language-based RL is often sparse and subjective, relying on heuristic evaluations rather than clear performance metrics [13]. Scaling Techniques - Scaling is crucial for enhancing LLM performance and efficiency, though it presents significant computational challenges. Techniques like Chain-of-Thought (CoT) reasoning and search-based methods help improve multi-step reasoning and factual accuracy [14][15]. - Despite advancements, challenges such as diminishing returns and increased inference time remain, necessitating targeted strategies for efficient deployment [15]. Evaluation Benchmarks - Various benchmarks have been proposed to assess the performance of LLM post-training, ensuring a comprehensive understanding of their strengths and limitations across different tasks [46]. - These benchmarks play a vital role in improving response accuracy, robustness, and ethical compliance during the post-processing phase [46]. Future Directions - The article highlights the growing interest in RL for optimizing LLMs since 2020, emphasizing the need for interactive methods and robust reward modeling to address challenges like reward hacking [52]. - Key areas for future research include personalized and adaptive LLMs, process versus outcome reward optimization, and the integration of dynamic reasoning frameworks to enhance model performance in complex queries [53].
从论文中积累复现 R1 的 insight
理想TOP2· 2025-04-30 13:04
Core Viewpoint - The article discusses advancements in reinforcement learning (RL) techniques for large language models (LLMs), emphasizing the need for improved algorithms, reward design, and training strategies to enhance reasoning capabilities and model performance. Group 1: Algorithm Improvements - Current algorithms have significant room for improvement, with the introduction of Dr. GRPO addressing issues in GRPO related to response length bias and problem difficulty bias, leading to better token efficiency and reasoning performance [3][4]. - The DAPO method is proposed to tackle entropy collapse and sample efficiency issues in GRPO and PPO, enhancing training stability and efficiency through techniques like Clip-Higher and dynamic sampling [6]. Group 2: Training Strategies - Larger training batch sizes (e.g., TBS = 1024) enhance training efficiency and stability, while on-policy strategies are more advantageous than off-policy ones for model exploration [6]. - Increasing rollout times (e.g., n = 64) improves training outcomes, encouraging longer responses, and a dynamic annealing strategy for KL penalty is recommended to balance exploration and stability [6]. Group 3: Reward Design - Early reward design flaws led to various reward hacking behaviors, necessitating a refined reward system that includes format and answer rewards to constrain model behavior and avoid cheating [6]. - The relationship between response length and reasoning ability is not causal; longer responses may provide more exploration space but do not directly enhance reasoning performance [6]. Group 4: Generalization and Learning - RL is more effective than supervised fine-tuning (SFT) in promoting generalization across tasks, suggesting that reasoning can be a universal capability stimulated by specific tasks [7][9]. - Combining rule-based rewards with reward model-based rewards is beneficial, especially in tasks without clear answers, to enhance learning and mitigate reward hacking [9].
新势力 AI 大模型全对比:小鹏野心、理想务实、蔚来追赶
2 1 Shi Ji Jing Ji Bao Dao· 2025-04-29 12:07
Core Insights - The rapid development of AI models, particularly in the automotive sector, is highlighted by the emergence of large-scale models like Xiaopeng's 720 billion parameter model and Li Auto's 22 billion parameter MindVLA model, indicating a competitive race among new automotive players [1][2][21] - Xiaopeng's strategy focuses on cloud-based model training and distillation to overcome limitations in on-vehicle computing power, while Li Auto emphasizes practical applications with its VLA model [2][12][21] - NIO appears to lag behind in the AI model race, having not made significant advancements since the introduction of its NWM model, which is still not widely deployed [4][18][21] Xiaopeng's AI Strategy - Xiaopeng is developing a "world base model" that utilizes a large language model (LLM) backbone and extensive multimodal driving data, aiming for a comprehensive understanding and interaction with the physical world [1][8] - The "cloud model factory" allows for rapid iteration cycles of about five days, leveraging powerful AI infrastructure and data processing capabilities [2][13] - Xiaopeng's approach includes reinforcement learning to enhance the model's ability to handle extreme scenarios, which is crucial for autonomous driving [9][17] Li Auto's Approach - Li Auto's MindVLA model is designed to interact with the physical world, similar to robotics, and is deployed directly on vehicles [2][14] - The company has successfully implemented an end-to-end system that has been emulated by other automakers, showcasing its leadership in the field [14][15] - Li Auto's focus on practical applications and user feedback is evident in its development of a model that aligns with human driving behavior [17][21] NIO's Position - NIO's NWM model aims to enhance spatial understanding and predictive capabilities but has faced delays in large-scale deployment due to organizational changes and regulatory challenges [4][18] - The company is leveraging a "crowd intelligence" approach, utilizing data from its fleet to improve model training and safety features [20][21] - Despite slower progress, NIO emphasizes safety and has implemented advanced safety features, positioning itself as a cautious player in the competitive landscape [20][21] Industry Trends - The automotive industry is witnessing a shift from traditional mapping to end-to-end AI models, with companies exploring various technical paths to enhance autonomous driving capabilities [4][5] - The performance of language models is showing diminishing returns as parameter sizes increase, prompting a move towards multimodal models by major tech players [4][5] - The competition among Xiaopeng, Li Auto, and NIO reflects broader trends in the industry, where technological ambition, practical application, and safety considerations are critical for success [21]
对谈 Pokee.ai 朱哲清:强化学习做核心,Agent 的少数派造法
晚点LatePost· 2025-04-29 08:43
可能是更高效、更便宜的 Agent 实现路径。 文 丨 孙海宁 编辑 丨 程曼祺 主流 AI Agent 都把大语言模型(LLM,或者它的多模态版本)当作 "大脑",靠一个或几个 LLM 编 排工作、调用工具。但也有另一条路:Agent 规划、作业靠不依赖自然语言的强化学习模型,LLM 只 充当 Agent 和人类的 "交互层"。 不一样的想法,来自去年 10 月成立,至今只有 4 个正式员工的 Pokee.ai。 Pokee.ai 创始人朱哲清有十余年强化学习研究、落地经验。2017 年起,从杜克大学计算机科学专业毕 业的朱哲清,一边在斯坦福大学攻读强化学习方向博士学位,师从 Benjamin Van Roy;一边在 Meta 工作,曾任 Meta"应用强化学习" 部门负责人,他用强化学习算法改善内容推荐系统,把上任前只剩 3 人,一度要关停的部门扩张至 10 余人,为 Meta 增收 5 亿美元。 靠 LLM 规划、决策,是个自然而主流的想法。OpenAI Operator 和网页交互、操作电脑的能力基于 GPT-4o 模型,Manus 完成任务则是靠 Claude 3.5 Sonnet 模型做长程规划。 ...
四个理工男“硬刚”妇科诊断推理大模型,更小参数量实现更高准确率
Tai Mei Ti A P P· 2025-04-29 02:22
Core Insights - The article discusses the "resource misalignment battle" in the AI sector, where large companies focus on parameter upgrades while smaller startups target niche markets that larger firms overlook [1] - The medical industry is highlighted as a high-risk area with stringent accuracy requirements, making it difficult for general models to meet specific needs [1] - There is a growing recognition among AI companies of the importance of specialized models in vertical fields, particularly in healthcare [1] Industry Analysis - The medical field requires vertical models to achieve higher accuracy, with general models only reaching a passing score [1][2] - The relationship between general and vertical models is likened to that of a medical student and a specialized doctor, emphasizing the need for extensive practical experience [2] - Companies like 壹生检康 are focusing on developing specialized models to address the limitations of general models in specific medical scenarios [4][5] Model Development - 壹生检康 has been developing a gynecological vertical model, selecting a 32B parameter model as the optimal balance between computational resources and response effectiveness [5][7] - The training process involved multiple rounds, with the first round yielding a 50% accuracy rate, which improved to 77.1% after addressing data imbalance issues [6][13] - The final model demonstrated superior performance compared to existing models, particularly in diagnosing specific gynecological conditions [13][14] Application and Impact - The gynecological model aims to provide precise and professional services to end-users, addressing common health issues faced by young women [18] - The model is also designed to empower healthcare providers in resource-limited settings, enabling them to offer reliable gynecological consultations [18] - The use of reinforcement learning is suggested as a future direction to enhance the model's capabilities and expand its application to other medical fields [19]