强化学习
Search documents
后训练时代如何延续Scaling Law?这是你该读的LLM后训练综述
机器之心· 2025-05-01 02:11
Core Insights - The article discusses the significance of post-training techniques such as fine-tuning and reinforcement learning (RL) in enhancing the capabilities of large language models (LLMs) [1][2][5]. Summary by Sections Overview of LLM Post-Training - A recent review report on LLM post-training has gained positive feedback, compiling a resource library of related papers and tools that has received over 700 stars [2]. - The review includes contributions from institutions like UAE University of Artificial Intelligence, University of Central Florida, Google DeepMind, and the University of Oxford, covering techniques to enhance LLMs through RL, supervised fine-tuning, and evaluation benchmarks [2]. Challenges in LLMs - Despite advancements, LLMs face issues such as generating misleading content (referred to as "hallucinations") and maintaining logical consistency in longer conversations [5]. - The reasoning capabilities of LLMs are debated, as they operate on implicit statistical patterns rather than explicit logical reasoning, which can lead to difficulties in simple logical tasks [5]. Training Phases of LLMs - The training process of LLMs is divided into two main phases: pre-training and post-training. Pre-training focuses on next-token prediction using large datasets, while post-training involves multiple rounds of fine-tuning and alignment to improve model behavior and reduce biases [6]. Fine-Tuning Techniques - Fine-tuning is essential for adapting pre-trained LLMs to specific tasks, enhancing their performance in areas like sentiment analysis and medical diagnosis. However, it carries risks of overfitting and high computational costs [7][10]. - Efficient techniques like Low-Rank Adaptation (LoRA) and adapters can reduce computational overhead while allowing models to specialize in specific tasks [10]. Reinforcement Learning in LLMs - RL is introduced to improve LLM adaptability through dynamic feedback and optimization of sequential decisions. This differs from traditional RL settings, as LLMs select tokens from a vast vocabulary rather than a limited action space [9][11]. - The feedback in language-based RL is often sparse and subjective, relying on heuristic evaluations rather than clear performance metrics [13]. Scaling Techniques - Scaling is crucial for enhancing LLM performance and efficiency, though it presents significant computational challenges. Techniques like Chain-of-Thought (CoT) reasoning and search-based methods help improve multi-step reasoning and factual accuracy [14][15]. - Despite advancements, challenges such as diminishing returns and increased inference time remain, necessitating targeted strategies for efficient deployment [15]. Evaluation Benchmarks - Various benchmarks have been proposed to assess the performance of LLM post-training, ensuring a comprehensive understanding of their strengths and limitations across different tasks [46]. - These benchmarks play a vital role in improving response accuracy, robustness, and ethical compliance during the post-processing phase [46]. Future Directions - The article highlights the growing interest in RL for optimizing LLMs since 2020, emphasizing the need for interactive methods and robust reward modeling to address challenges like reward hacking [52]. - Key areas for future research include personalized and adaptive LLMs, process versus outcome reward optimization, and the integration of dynamic reasoning frameworks to enhance model performance in complex queries [53].
从论文中积累复现 R1 的 insight
理想TOP2· 2025-04-30 13:04
Core Viewpoint - The article discusses advancements in reinforcement learning (RL) techniques for large language models (LLMs), emphasizing the need for improved algorithms, reward design, and training strategies to enhance reasoning capabilities and model performance. Group 1: Algorithm Improvements - Current algorithms have significant room for improvement, with the introduction of Dr. GRPO addressing issues in GRPO related to response length bias and problem difficulty bias, leading to better token efficiency and reasoning performance [3][4]. - The DAPO method is proposed to tackle entropy collapse and sample efficiency issues in GRPO and PPO, enhancing training stability and efficiency through techniques like Clip-Higher and dynamic sampling [6]. Group 2: Training Strategies - Larger training batch sizes (e.g., TBS = 1024) enhance training efficiency and stability, while on-policy strategies are more advantageous than off-policy ones for model exploration [6]. - Increasing rollout times (e.g., n = 64) improves training outcomes, encouraging longer responses, and a dynamic annealing strategy for KL penalty is recommended to balance exploration and stability [6]. Group 3: Reward Design - Early reward design flaws led to various reward hacking behaviors, necessitating a refined reward system that includes format and answer rewards to constrain model behavior and avoid cheating [6]. - The relationship between response length and reasoning ability is not causal; longer responses may provide more exploration space but do not directly enhance reasoning performance [6]. Group 4: Generalization and Learning - RL is more effective than supervised fine-tuning (SFT) in promoting generalization across tasks, suggesting that reasoning can be a universal capability stimulated by specific tasks [7][9]. - Combining rule-based rewards with reward model-based rewards is beneficial, especially in tasks without clear answers, to enhance learning and mitigate reward hacking [9].
新势力 AI 大模型全对比:小鹏野心、理想务实、蔚来追赶
2 1 Shi Ji Jing Ji Bao Dao· 2025-04-29 12:07
Core Insights - The rapid development of AI models, particularly in the automotive sector, is highlighted by the emergence of large-scale models like Xiaopeng's 720 billion parameter model and Li Auto's 22 billion parameter MindVLA model, indicating a competitive race among new automotive players [1][2][21] - Xiaopeng's strategy focuses on cloud-based model training and distillation to overcome limitations in on-vehicle computing power, while Li Auto emphasizes practical applications with its VLA model [2][12][21] - NIO appears to lag behind in the AI model race, having not made significant advancements since the introduction of its NWM model, which is still not widely deployed [4][18][21] Xiaopeng's AI Strategy - Xiaopeng is developing a "world base model" that utilizes a large language model (LLM) backbone and extensive multimodal driving data, aiming for a comprehensive understanding and interaction with the physical world [1][8] - The "cloud model factory" allows for rapid iteration cycles of about five days, leveraging powerful AI infrastructure and data processing capabilities [2][13] - Xiaopeng's approach includes reinforcement learning to enhance the model's ability to handle extreme scenarios, which is crucial for autonomous driving [9][17] Li Auto's Approach - Li Auto's MindVLA model is designed to interact with the physical world, similar to robotics, and is deployed directly on vehicles [2][14] - The company has successfully implemented an end-to-end system that has been emulated by other automakers, showcasing its leadership in the field [14][15] - Li Auto's focus on practical applications and user feedback is evident in its development of a model that aligns with human driving behavior [17][21] NIO's Position - NIO's NWM model aims to enhance spatial understanding and predictive capabilities but has faced delays in large-scale deployment due to organizational changes and regulatory challenges [4][18] - The company is leveraging a "crowd intelligence" approach, utilizing data from its fleet to improve model training and safety features [20][21] - Despite slower progress, NIO emphasizes safety and has implemented advanced safety features, positioning itself as a cautious player in the competitive landscape [20][21] Industry Trends - The automotive industry is witnessing a shift from traditional mapping to end-to-end AI models, with companies exploring various technical paths to enhance autonomous driving capabilities [4][5] - The performance of language models is showing diminishing returns as parameter sizes increase, prompting a move towards multimodal models by major tech players [4][5] - The competition among Xiaopeng, Li Auto, and NIO reflects broader trends in the industry, where technological ambition, practical application, and safety considerations are critical for success [21]
对谈 Pokee.ai 朱哲清:强化学习做核心,Agent 的少数派造法
晚点LatePost· 2025-04-29 08:43
可能是更高效、更便宜的 Agent 实现路径。 文 丨 孙海宁 编辑 丨 程曼祺 主流 AI Agent 都把大语言模型(LLM,或者它的多模态版本)当作 "大脑",靠一个或几个 LLM 编 排工作、调用工具。但也有另一条路:Agent 规划、作业靠不依赖自然语言的强化学习模型,LLM 只 充当 Agent 和人类的 "交互层"。 不一样的想法,来自去年 10 月成立,至今只有 4 个正式员工的 Pokee.ai。 Pokee.ai 创始人朱哲清有十余年强化学习研究、落地经验。2017 年起,从杜克大学计算机科学专业毕 业的朱哲清,一边在斯坦福大学攻读强化学习方向博士学位,师从 Benjamin Van Roy;一边在 Meta 工作,曾任 Meta"应用强化学习" 部门负责人,他用强化学习算法改善内容推荐系统,把上任前只剩 3 人,一度要关停的部门扩张至 10 余人,为 Meta 增收 5 亿美元。 靠 LLM 规划、决策,是个自然而主流的想法。OpenAI Operator 和网页交互、操作电脑的能力基于 GPT-4o 模型,Manus 完成任务则是靠 Claude 3.5 Sonnet 模型做长程规划。 ...
四个理工男“硬刚”妇科诊断推理大模型,更小参数量实现更高准确率
Tai Mei Ti A P P· 2025-04-29 02:22
Core Insights - The article discusses the "resource misalignment battle" in the AI sector, where large companies focus on parameter upgrades while smaller startups target niche markets that larger firms overlook [1] - The medical industry is highlighted as a high-risk area with stringent accuracy requirements, making it difficult for general models to meet specific needs [1] - There is a growing recognition among AI companies of the importance of specialized models in vertical fields, particularly in healthcare [1] Industry Analysis - The medical field requires vertical models to achieve higher accuracy, with general models only reaching a passing score [1][2] - The relationship between general and vertical models is likened to that of a medical student and a specialized doctor, emphasizing the need for extensive practical experience [2] - Companies like 壹生检康 are focusing on developing specialized models to address the limitations of general models in specific medical scenarios [4][5] Model Development - 壹生检康 has been developing a gynecological vertical model, selecting a 32B parameter model as the optimal balance between computational resources and response effectiveness [5][7] - The training process involved multiple rounds, with the first round yielding a 50% accuracy rate, which improved to 77.1% after addressing data imbalance issues [6][13] - The final model demonstrated superior performance compared to existing models, particularly in diagnosing specific gynecological conditions [13][14] Application and Impact - The gynecological model aims to provide precise and professional services to end-users, addressing common health issues faced by young women [18] - The model is also designed to empower healthcare providers in resource-limited settings, enabling them to offer reliable gynecological consultations [18] - The use of reinforcement learning is suggested as a future direction to enhance the model's capabilities and expand its application to other medical fields [19]
上海车展|Momenta与六大品牌达成战略合作,累计合作量产车型超130款
Guan Cha Zhe Wang· 2025-04-29 01:48
Core Insights - Momenta announced further strategic collaborations with six major brands during the Shanghai Auto Show, including General Motors Buick, FAW Toyota, Honda China, Cadillac, SAIC Audi, and Zhiji [1][3] - The company has seen a significant increase in the number of mass-produced models delivered, from 1 model in 2022 to 8 in 2023, and projected to reach 26 models in 2024 [3] - Momenta's cumulative number of cooperative mass-produced models has exceeded 130, with an accelerating growth rate in successful deliveries [3] Delivery and Growth Metrics - The first 100,000 units equipped with Momenta's technology took two years to achieve, while the second 100,000 units were completed in just six months [3] - The company expects to complete the third batch of nearly 100,000 units by May of this year [3] Global Partnerships - Momenta's partners now include major global automakers such as Honda, Nissan, Chery, Audi, Volkswagen, and Cadillac, indicating a broad market reach [3] Technological Advancements - The "Flywheel Model" is a key upgrade in Momenta's algorithm capabilities, with plans to launch the end-to-end Momenta R6 Flywheel Model based on reinforcement learning in the second half of this year [5] - Momenta's intelligent driving solutions do not require high-precision maps, providing an advantage for deployment in various global markets [5] Focus on Robotaxi Development - Momenta is focusing on the development of autonomous Robotaxi services, addressing the challenge of safety standards for large-scale deployment [7] - The company aims to achieve safety levels for Robotaxi operations that are equivalent to or exceed human driving standards as fleet sizes grow [7] - The first mass-produced Robotaxi solution is set to launch this year, utilizing existing sensors and computing units to reduce costs [7] - The initial batch of unmanned Robotaxis is expected to enter trial operations by the end of 2025, offering users automated driving services [7]
小小井字棋难倒大模型??大神卡帕西被OpenAI在线踢馆了
量子位· 2025-04-28 03:43
克雷西 发自 凹非寺 量子位 | 公众号 QbitAI 宝可梦之后,让大模型玩 井字棋 又成了一个新的热门挑战。 起因是网友在X上吐槽大模型宝可梦玩得不够好,结果被大神 Karpathy 翻了牌子: 别盯着宝可梦了,让大模型玩井字棋会更有趣,它们不会。 结果Karpathy的话引发了大量围观,有人表示惊讶,也有人在分析原因,还有人表示那句经典的话含金量还在上升: 对人类而言很简单的任务,对机器来说反而很难;对人类而言难的任务,对机器来说反而简单。 不过也有人表示不服,其中就包括OpenAI的 Noam Brown ,他表示让o3玩井字棋完全没问题, 甚至还能看图下棋 。 大模型挑战井字棋 我们也尝试了一下,用不同的方式和o3对战。 第一种方式是用O和X表示棋子,-表示空位,每次直接把完整的棋局输入给o3,并要求其用同样的方式输出。 思考约12秒之后,o3首先占据了棋盘中央的位置,我们落子之后,o3又思考了23秒,放置了第二颗X棋子。 接下来的两个回合情况是这样,其实当o3占据对角线上两个位置的时候就已经锁定了胜局。 不过有意思的是,直到已经连成一条线, o3都没发现自己已经赢了 。 | | | | | XOO ...
重磅发布 | 复旦《大规模语言模型:从理论到实践(第2版)》全新升级,聚焦AI前沿
机器之心· 2025-04-28 01:26
机器之心发布 机器之心编辑部 《大规模语言模型:从理论到实践(第 2版)》 是一本理论与实践并重的专业 技术书 ,更是 AI时代不可或缺的知识工具书。 任何人 都能在本 书中找到属于自己的成长路径。 在人工智能浪潮席卷全球的今天,大语言模型正以前所未有的速度推动着科技进步和产业变革。从 ChatGPT 到各类行业应用,LLM 不仅重塑 了人机交互的方式,更成为推动学术研究与产业创新的关键技术。 面对这一飞速演进的技术体系,如何系统理解其理论基础、掌握核心算法与工程实践,已成为每一位 AI 从业者、研究者、高校学子的必修课。 2023 年 9 月,复旦大学张奇、桂韬、郑锐、黄萱菁研究团队面向全球学术界与产业界正式发布了《大规模语言模型:从理论到实践》。短短 两年,大语言模型在理论研究、预训练方法、后训练技术及解释性等方面取得了重要进展。业界对大语言模型的研究更加深入,逐渐揭示出许多 与传统深度学习和自然语言处理范式不同的特点。例如, 大语言模型仅需 60 条数据就能学习并展现出强大的问题回答能力,显示了其惊人的 泛化性 。然而,本书作者们也发现大语言模型存在一定的脆弱性。例如, 在一个拥有 130 亿个参数的模 ...
深度|清华姚班学霸、OpenAI姚顺雨:AI下半场从“算法竞赛”转向“效用定义”,重构评估框架,将技术能力转化为真实世界价值
Z Potentials· 2025-04-25 03:05
Core Insights - The article discusses the transition of AI from a phase focused on model innovation and benchmark testing to a new phase emphasizing problem definition and evaluation [3][23][30] - It highlights the importance of reinforcement learning achieving generalization capabilities, allowing it to tackle diverse tasks previously thought to be unrelated [3][4][21] Group 1: AI's First Half - The first half of AI's development was characterized by significant breakthroughs in training methods and models, such as Transformer and GPT-3, which focused on improving model performance on benchmarks [4][5][7] - The emphasis was on creating new models rather than defining tasks, leading to a cycle of developing increasingly difficult benchmarks that could be solved with existing methods [7][8][23] Group 2: Breakthrough Formula - The effective formula for AI's success includes large-scale language pre-training, scaling (data and compute), and the integration of reasoning and action [9][14] - The realization that prior knowledge is crucial for generalization has shifted the focus from solely algorithm development to understanding the environment and prior knowledge [15][21] Group 3: Transition to the Second Half - The second half of AI will focus on redefining evaluation frameworks and creating new assessment methods that reflect real-world applications rather than just benchmark performance [26][27][29] - The industry faces the "utility problem," where existing evaluation frameworks do not align with real-world tasks, necessitating a reevaluation of how AI's effectiveness is measured [27][29] Group 4: Future Directions - The new game in AI's second half involves leveraging existing formulas to solve real-world tasks while innovating new components to enhance these formulas [32] - Companies will need to create new hypotheses that challenge existing paradigms to achieve significant breakthroughs and develop valuable products worth billions or trillions [30][32]
卓驭科技接入通义大模型,联合打造端到端世界模型
阿里云· 2025-04-24 09:13
Core Insights - The article highlights the collaboration between Zhuoyu Technology and Alibaba Cloud, focusing on the integration of the Tongyi large model and the development of an end-to-end world model [1][2] - Zhuoyu's end-to-end world model incorporates reinforcement learning and chain reasoning technology, enhancing safety in urban navigation and enabling personalized driving styles and natural language interaction [2] Summary by Sections - **Integration with Alibaba Cloud** - Zhuoyu Technology has fully migrated its core business systems, including big data and intelligent manufacturing, to Alibaba Cloud [1] - The company has established a GPU resource pool on the Alibaba Cloud PAI platform to meet the high computational demands of its model training [2] - **Model Training Efficiency** - The training method combines pre-training and post-training, resulting in a training efficiency improvement of over 50% compared to single GPU clusters [2] - The utilization rate of GPUs has been increased to over 95% due to the serverless capabilities of the Alibaba Cloud PAI platform, which simplifies cluster operations and ensures full observability of the training process [2] - **Development Acceleration** - In the research and development domain, Zhuoyu has integrated Tongyi Lingma and Tongyi Qianwen to accelerate development, achieving a code adoption rate of 29% [2]