大模型后训练 - filings, earnings calls, financial reports, news

大模型后训练

Search documents

机器之心· 2025-10-15 02:54

AIME2024 上的学习表现技术方案概述：用「风险度量」破局，MVaR + 捆绑策略双管齐下为解决传统均值优化的缺陷，北大团队提出 RiskPO ，核心突破在于将风险规避（risk-averse）理念融入优化目标，用「关注奖励分布左尾（难任务）」替代「追求整体均值」，从根本上引导模型突破推理短板。该项目由北京大学彭一杰教授课题组完成，第一作者为任韬，其他作者包括江金阳、杨晖等。研究背景与挑战：大模型后训练陷入「均值陷阱」，推理能力难破界当强化学习（RL）成为大模型后训练的核心工具，「带可验证奖励的强化学习（RLVR）」凭借客观的二元反馈（如解题对错），迅速成为提升推理能力的主流范式。从数学解题到代码生成，RLVR 本应推动模型突破「已知答案采样」的局限，真正掌握深度推理逻辑 —— 但现实是，以 GRPO 为代表的主流方法正陷入「均值优化陷阱」。这些基于均值的优化策略，过度聚焦高概率输出序列，却忽略了「低概率但高信息密度」的推理路径：模型训练早期就会出现熵坍缩，过早丧失探索能力；面对全错的难题时，优势函数直接归零，模型在薄弱环节完全无法学习。最终结果是，大模型看似在 Pass@1 ...

NeurIPS 25 | GRPO进阶版来了，GVPO重构大模型后训练范式

机器之心· 2025-10-14 02:06

大模型后训练（post-training）正在成为 AI 进化的关键一环。从最早的 SFT（监督微调），再到近来大火的 GRPO，一条核心主线贯穿始终：如何让大模型具有更强的推理能力、更好地对齐人类偏好，同时保持稳定和高效。然而，GRPO 虽然在 DeepSeek-R1 等项目中大放异彩，但其训练不稳定、超参数敏感的问题一直限制其大规模落地。现在，作业帮团队联合香港科技大学（广州）在 NeurIPS 2025 上提出了全新方法： GVPO（Group Variance Policy Optimization）。GVPO 通过避免重要性采样解决了 GRPO 的稳定性难题，并能在理论上提供了唯一最优解保证，并且在实验中表现全面超越现有方法。论文标题: GVPO: Group Variance Policy Optimization for Large Language Model Post-Training GVPO 设计动机受到 DPO 的启发，研究团队也希望在 GRPO 的场景（即每个 prompt 进行多次采样）下，同样能够利用 KL 约束下 Reward 最大化的解析解： $R_{\the ...

真正的AI竞争力，藏在大模型“后训练”这一步

量子位· 2025-10-13 08:47

Core Insights - The article emphasizes the importance of Post-Training as a transformative approach in AI, moving beyond simple model optimization to creating specialized intelligent engines tailored to specific business needs [1][4] - The evolution of Post-Training technology is highlighted, showcasing a shift from Supervised Fine-Tuning (SFT) to Reinforcement Learning (RL) methodologies, which better align with complex business requirements [2][4] Summary by Sections Post-Training Evolution - The initial approach in the industry was SFT, which allowed models to learn specific domain knowledge and dialogue styles [2] - However, SFT was insufficient for teaching models complex value judgments and strategic choices, which are critical in real business scenarios [3] - The focus has shifted to RL, evolving from human-dependent methods (RLHF) to automated systems (RLVR) and the innovative use of Natural Language Rewards [4][5] Implementation Pathway - The article outlines a four-step pathway for enterprises to implement Post-Training effectively, addressing challenges such as data quality, high labeling costs, and defining reward signals [5][8] - Successful case studies from companies like Zhihu, AutoHome, and Weibo illustrate practical applications of these steps, showcasing improvements in data quality and model performance [7][8] Step 1: Data Preparation - High-quality data is identified as the cornerstone of successful Post-Training, with companies spending 60-70% of their time on data preparation [10] - Zhihu and AutoHome have developed methods to enhance data quality through pre-labeling and structured data utilization, respectively [11][13] Step 2: Model Selection - Choosing the right base model is crucial, with many companies opting for the Tongyi Qianwen series due to its performance and support for Post-Training [14][16] - The model's architecture and open-source ecosystem facilitate easier implementation of Post-Training techniques [15][18] Step 3: Reward Mechanism Design - The design of a reward mechanism is essential for aligning model outputs with business objectives, transitioning from human feedback to automated verification systems [24][25] - Companies like Yingmi Fund are exploring ways to integrate expert decision-making frameworks into their models to enhance performance [26] Step 4: Evaluation System - A robust evaluation system is necessary to measure the effectiveness of Post-Training, with Yingmi Fund developing benchmarks to assess model performance in real-world scenarios [27][28] - Successful implementations have led to significant improvements in model accuracy and business outcomes, as seen in the case of Baifeng Cloud and Quark [30][32] Conclusion - The article concludes that the true competitive advantage in AI lies in how companies leverage their unique data and business insights through Post-Training to create proprietary intelligent engines [32]

大模型后训练

强化学习

模型蒸馏

Artificial Intelligence

Artificial Intelligence

通义千问

科普向：一文解构大模型后训练，GRPO和它的继任者们的前世今生

3 6 Ke· 2025-09-01 04:38

Group 1 - The core concept of the article revolves around the evolution of post-training methods in large language models, particularly focusing on the GRPO algorithm as a significant advancement in reinforcement learning paradigms [2][46]. - GRPO has emerged as a universal reinforcement learning algorithm applicable to a wide range of post-training tasks, with notable improvements over previous methods like PPO [2][48]. - The article discusses the importance of post-training in enhancing the adaptability and flexibility of models, addressing the limitations of pre-training alone [5][46]. Group 2 - The article highlights the transition from PPO to GRPO, emphasizing the reduction of computational costs and memory requirements, making GRPO a more efficient alternative [18][14]. - GRPO's methodology involves using historical performance data to establish a baseline for advantage estimation, eliminating the need for a separate value function [16][14]. - Despite its advantages, GRPO still faces stability issues, prompting further research and development of improved algorithms like DAPO and GSPO [19][48]. Group 3 - DAPO, developed by ByteDance and Tsinghua AIR, builds upon GRPO by introducing enhancements such as Clip-Higher and dynamic sampling to improve training efficiency [20][21]. - GSPO represents a significant advancement by shifting the focus from token-level to sequence-level importance sampling, which enhances training stability [28][30]. - GFPO addresses the limitations of GRPO by allowing for the simultaneous optimization of multiple response attributes, thus improving the overall performance of models [33][34].

Microsoft(US:MSFT)

大模型后训练

强化学习

Artificial Intelligence

Artificial Intelligence

GFPO

GPT

GRPO

科普向：一文解构大模型后训练，GRPO和它的继任者们的前世今生

机器之心· 2025-09-01 02:49

Core Viewpoint - The article discusses the evolution and significance of the Group Relative Policy Optimization (GRPO) algorithm in the context of large language models and reinforcement learning, highlighting its advantages and limitations compared to previous methods like Proximal Policy Optimization (PPO) [4][38]. Summary by Sections Development of Large Language Models - The rapid advancement of large language models has led to the emergence of various post-training methods, with GRPO being a notable innovation that enhances reinforcement learning paradigms [3][5]. Post-Training and Reinforcement Learning - Post-training is crucial for refining models' capabilities in specific domains, enhancing adaptability and flexibility to meet diverse application needs [12][11]. - Reinforcement learning, particularly through human feedback (RLHF), plays a vital role in the post-training phase, aiming to optimize model outputs based on user preferences [14][19]. GRPO and Its Advantages - GRPO eliminates the need for a separate critic model, reducing memory and computational costs significantly compared to PPO, which requires dual networks [30][35]. - The GRPO framework utilizes historical performance data to establish a baseline for evaluating model improvements, thus simplifying the training process [34][35]. Comparison of GRPO and PPO - GRPO offers substantial improvements in memory requirements and training speed, making it a more efficient choice for large language model training [37]. - Despite its advantages, GRPO still faces stability issues similar to those of PPO, particularly in smaller-scale reinforcement learning tasks [39]. Recent Innovations: DAPO, GSPO, and GFPO - DAPO introduces enhancements to GRPO, such as Clip-Higher and dynamic sampling, to address practical challenges encountered during training [41][42]. - GSPO advances the methodology by shifting the focus from token-level to sequence-level importance sampling, significantly improving training stability [48][49]. - GFPO allows for simultaneous optimization of multiple response attributes, addressing limitations of GRPO related to scalar feedback and multi-round reasoning tasks [61][63]. Conclusion - The evolution of post-training methods, from PPO to GRPO and beyond, illustrates a clear trajectory in optimizing large language models, with GRPO serving as a pivotal point for further advancements in the field [81][82].

Artificial Intelligence

Artificial Intelligence

GRPO

DAPO