大模型后训练
Search documents
8块钱跑通一次强化学习全流程,潞晨云重塑微调赛道:1名算法工程师=1支Infra团队
量子位· 2026-01-07 05:17
允中 发自 凹非寺 量子位 | 公众号 QbitAI 大模型下半场的战火,已经从"暴力预训练"烧向了"后训练"战场。 潞晨云微调SDK 正式开放上线——这是 国内首个全面开放、且兼容Tinker范式的Serverless微调平台 。 其基于Thinking Machine Lab开源的Tinker SDK构建,核心目标只有一个: 为复杂且昂贵的强化学习,提供一套更具成本优势的工业级解法。 拥抱后训练与RL:算法层与底层算力架构的解耦 随着OpenAI o1在推理能力上的突破,业界逐渐形成共识—— 无论是OpenAI o1的推理突破,还是DeepSeek-R1靠强化学习 (RL) 实现的性能飞跃,都释放了一个明确信号: 决定模型天花板的,不再只是算力堆砌,而是更精准的微调和RL迭代。 但现实很骨感——复杂的分布式基建、高昂的显卡租金、繁琐的架构调优,像一道道高墙,把无数算法工程师挡在了"炼丹房"外。 现在,这堵墙正在被推倒。 即大模型的能力突破已不再单纯依赖预训练 (Pre-training) 阶段的参数堆砌, 后训练(Post-Training) 特别是强化学习,正成为决定模 型实用价值的核心战场 。 以De ...
OpenAI前CTO首个创业产品Tinker,这里全量升级开放了,还有羊毛可薅
机器之心· 2026-01-07 05:16
机器之心发布 潞晨云微调 SDK 今日起全量开放,前 150 名用户通过专属链接注册,可获得 30 元 Token 使用额度: https://cloud.luchentech.com/account/signup?invitation_code=JQZX 当 OpenAI 前 CTO Mira Murati 创立的 Thinking Machines Lab (TML) 用 Tinker 创新性的将大模型训练抽象成 forward backward,optimizer step 等⼀系列基本原语,分 离了算法设计等部分与分布式训练基础设施关联,把 "训练" 大模型变成了简单的 "函数调用" 时,行业进入一场从 "作坊式炼丹" 到 "工业化微调" 的升级。 潞晨云微调 SDK 正式开放上线 :基于 Thinking Machine Lab 开源的 Tinker SDK 构建,作为 国内首个兼容 Tinker 范式且全面开放的 Serverless 微调平台 ,为复杂 昂贵的强化学习提供更具成本优势的工业级解法 —— 开发者无需囤卡,rollout→reward→update 全链路按 Token 计价,让每一 ...
华泰证券今日早参-20251204
HTSC· 2025-12-04 01:43
Group 1: Macroeconomic Insights - The Japanese central bank's potential interest rate hike in December could lead to an increase in government bond yields, influenced by high inflation and upcoming fiscal stimulus [2][3] - Global macroeconomic and policy expectations have been recalibrated, with service sector PMIs in the US, Europe, and Japan remaining high, while manufacturing PMIs have declined [3] - The market is experiencing fluctuations in response to the Federal Reserve's interest rate outlook, with mixed performances in US stock indices and a decline in oil prices [3] Group 2: Fixed Income Analysis - Cross-period price differences in interest rate derivatives are influenced by the CTD bond's coupon rate, full price, and three-month repo rates, along with market sentiment [4] - The movement of contracts during the roll period indicates strong participation in positive spreads, leading to an initial increase in cross-period price differences [4] Group 3: Consumer Sector Opportunities - The consumer sector is witnessing structural changes driven by technology and innovation, with new consumption trends emerging in areas like trendy toys, beauty products, and ready-to-drink beverages [6] - Investment strategies should focus on four main themes: the rise of domestic brands, technology-enabled consumption, emotional spending, and undervalued high-dividend blue-chip stocks [6] Group 4: Aerospace and Defense - The development of reusable rockets is crucial for reducing costs and increasing capacity in space activities, with companies like SpaceX leading the way [7] - China's advancements in reusable rocket technology, such as the Zhuque-3 and Long March 12, are expected to enhance space launch capabilities and reduce costs [7] Group 5: Energy Sector Analysis - Xin'ao Energy's privatization process is progressing, with key regulatory approvals completed, and the company is showing strong operational performance in natural gas retail [8] - The company's fundamentals are improving, supported by expanding projects and increasing customer penetration rates, leading to a positive long-term outlook [8] Group 6: Rating Changes - Recent adjustments in stock ratings include upgrades for companies like Hayuan Engineering and new buy ratings for firms such as Aerospace Intelligence Manufacturing and BOSS Zhipin, reflecting positive earnings forecasts [9]
北大彭一杰教授课题组提出RiskPO,用风险度量优化重塑大模型后训练
机器之心· 2025-10-15 02:54
Core Insights - The article discusses the limitations of traditional reinforcement learning (RL) methods in enhancing the reasoning capabilities of large models, particularly highlighting the "mean optimization trap" that leads to a lack of exploration and ineffective learning in challenging tasks [4][24]. - A new approach called RiskPO is introduced, which integrates risk-averse principles into the optimization objective, focusing on the left tail of the reward distribution to guide models in overcoming reasoning shortcomings [7][24]. Research Background and Challenges - The article outlines the challenges faced by large models in post-training, particularly the "mean optimization trap" that results in a loss of exploration ability and ineffective learning in difficult tasks [4][24]. - It emphasizes that existing methods, such as GRPO, have improved short-term metrics but have not expanded the reasoning boundaries necessary for complex tasks [4][24]. Technical Solution Overview - The RiskPO approach combines "risk measurement" with a bundling strategy to address the shortcomings of traditional mean optimization [6][7]. - The core of this approach is the "Mixed Value at Risk (MVaR)" objective function, which emphasizes the importance of low-reward, difficult tasks by replacing the pursuit of overall mean rewards [9][10]. Experimental Results - The North University team demonstrated the effectiveness of RiskPO across various tasks, achieving significant improvements in reasoning capabilities, particularly in challenging problems [15][18]. - In the AIME24 competition, RiskPO outperformed GRPO by nearly 7 percentage points in Pass@32 and achieved a Pass@1 score of 81.8% on the MATH500 dataset, surpassing GRPO by 2.6 percentage points [15][16]. Theoretical Support and Validation - The performance improvements of RiskPO are backed by solid theoretical foundations and rigorous ablation studies, showing that risk-averse updates can effectively mitigate entropy collapse [20][21]. - The article highlights that while mean-based metrics may show similar performance in early training, risk-sensitive metrics reveal significant advantages for RiskPO as training progresses [23][24]. Comparison with Alternative Strategies - A comparison with risk-seeking strategies demonstrated that focusing on easier tasks leads to rapid entropy collapse and stagnation in performance, while risk-averse strategies drive continuous improvement [26][27].
NeurIPS 25 | GRPO进阶版来了,GVPO重构大模型后训练范式
机器之心· 2025-10-14 02:06
Core Viewpoint - Post-training of large models is becoming a key aspect of AI evolution, focusing on enhancing reasoning capabilities, aligning with human preferences, and maintaining stability and efficiency [1]. Summary by Sections GVPO Introduction - The team from Zuoyebang and Hong Kong University of Science and Technology proposed a new method called GVPO (Group Variance Policy Optimization) to address the instability issues of GRPO (Generalized Reward Policy Optimization) [2]. Design Motivation - Inspired by DPO (Direct Preference Optimization), the research team aims to maximize rewards under KL constraints in the GRPO scenario, which involves multiple samplings for each prompt [5]. Practical Challenges - A significant challenge is the expectation calculation of Z(x) across all possible samples, which is nearly impractical. The team found that ensuring the sum of gradient weights for all samples under the same prompt equals zero allows Z(x) to cancel out, thus avoiding this computational difficulty [6]. Key Advantages of GVPO 1. **Unique Optimal Solution Guarantee**: GVPO's MSE form provides a strict mathematical proof that it achieves a unique optimal solution when R_θ equals R, ensuring algorithm effectiveness and stability [13]. 2. **No Need for Importance Sampling**: GVPO's optimal solution has minimal restrictions on sampling distribution, allowing for off-policy training without the common instability issues associated with importance sampling [14]. Analytical Perspectives - GVPO can be understood from three complementary analytical perspectives, each corresponding to an equivalent loss function: 1. **Negative Log-Likelihood Perspective (NLL)**: GVPO's loss function can be viewed as a weighted negative log-likelihood, allowing for flexible integration of historical and heterogeneous data sources [17]. 2. **Mean Squared Error Perspective (MSE)**: The optimization goal is to minimize the deviation between implicit and actual rewards, ensuring convergence to a unique global optimal solution under KL constraints [18]. 3. **Reinforcement Learning Perspective (RL)**: This perspective highlights the three components of the GVPO loss function, emphasizing the balance between actual and predicted rewards [19]. Experimental Results - In mathematical reasoning tasks, GVPO outperformed GRPO and its improved version Dr.GRPO across five benchmark tests, significantly enhancing the base model's performance [21]. - Ablation studies indicate GVPO's insensitivity to hyperparameter β and its excellent scalability with increased sampling numbers, allowing smaller models to match larger ones [23]. Significance and Future Prospects - GVPO represents a paradigm shift in post-training, moving from experience-driven approaches to those with theoretical guarantees, enhancing stability, flexibility, and efficiency in large model training [25][26].
真正的AI竞争力,藏在大模型“后训练”这一步
量子位· 2025-10-13 08:47
Core Insights - The article emphasizes the importance of Post-Training as a transformative approach in AI, moving beyond simple model optimization to creating specialized intelligent engines tailored to specific business needs [1][4] - The evolution of Post-Training technology is highlighted, showcasing a shift from Supervised Fine-Tuning (SFT) to Reinforcement Learning (RL) methodologies, which better align with complex business requirements [2][4] Summary by Sections Post-Training Evolution - The initial approach in the industry was SFT, which allowed models to learn specific domain knowledge and dialogue styles [2] - However, SFT was insufficient for teaching models complex value judgments and strategic choices, which are critical in real business scenarios [3] - The focus has shifted to RL, evolving from human-dependent methods (RLHF) to automated systems (RLVR) and the innovative use of Natural Language Rewards [4][5] Implementation Pathway - The article outlines a four-step pathway for enterprises to implement Post-Training effectively, addressing challenges such as data quality, high labeling costs, and defining reward signals [5][8] - Successful case studies from companies like Zhihu, AutoHome, and Weibo illustrate practical applications of these steps, showcasing improvements in data quality and model performance [7][8] Step 1: Data Preparation - High-quality data is identified as the cornerstone of successful Post-Training, with companies spending 60-70% of their time on data preparation [10] - Zhihu and AutoHome have developed methods to enhance data quality through pre-labeling and structured data utilization, respectively [11][13] Step 2: Model Selection - Choosing the right base model is crucial, with many companies opting for the Tongyi Qianwen series due to its performance and support for Post-Training [14][16] - The model's architecture and open-source ecosystem facilitate easier implementation of Post-Training techniques [15][18] Step 3: Reward Mechanism Design - The design of a reward mechanism is essential for aligning model outputs with business objectives, transitioning from human feedback to automated verification systems [24][25] - Companies like Yingmi Fund are exploring ways to integrate expert decision-making frameworks into their models to enhance performance [26] Step 4: Evaluation System - A robust evaluation system is necessary to measure the effectiveness of Post-Training, with Yingmi Fund developing benchmarks to assess model performance in real-world scenarios [27][28] - Successful implementations have led to significant improvements in model accuracy and business outcomes, as seen in the case of Baifeng Cloud and Quark [30][32] Conclusion - The article concludes that the true competitive advantage in AI lies in how companies leverage their unique data and business insights through Post-Training to create proprietary intelligent engines [32]
科普向:一文解构大模型后训练,GRPO和它的继任者们的前世今生
3 6 Ke· 2025-09-01 04:38
Group 1 - The core concept of the article revolves around the evolution of post-training methods in large language models, particularly focusing on the GRPO algorithm as a significant advancement in reinforcement learning paradigms [2][46]. - GRPO has emerged as a universal reinforcement learning algorithm applicable to a wide range of post-training tasks, with notable improvements over previous methods like PPO [2][48]. - The article discusses the importance of post-training in enhancing the adaptability and flexibility of models, addressing the limitations of pre-training alone [5][46]. Group 2 - The article highlights the transition from PPO to GRPO, emphasizing the reduction of computational costs and memory requirements, making GRPO a more efficient alternative [18][14]. - GRPO's methodology involves using historical performance data to establish a baseline for advantage estimation, eliminating the need for a separate value function [16][14]. - Despite its advantages, GRPO still faces stability issues, prompting further research and development of improved algorithms like DAPO and GSPO [19][48]. Group 3 - DAPO, developed by ByteDance and Tsinghua AIR, builds upon GRPO by introducing enhancements such as Clip-Higher and dynamic sampling to improve training efficiency [20][21]. - GSPO represents a significant advancement by shifting the focus from token-level to sequence-level importance sampling, which enhances training stability [28][30]. - GFPO addresses the limitations of GRPO by allowing for the simultaneous optimization of multiple response attributes, thus improving the overall performance of models [33][34].
科普向:一文解构大模型后训练,GRPO和它的继任者们的前世今生
机器之心· 2025-09-01 02:49
Core Viewpoint - The article discusses the evolution and significance of the Group Relative Policy Optimization (GRPO) algorithm in the context of large language models and reinforcement learning, highlighting its advantages and limitations compared to previous methods like Proximal Policy Optimization (PPO) [4][38]. Summary by Sections Development of Large Language Models - The rapid advancement of large language models has led to the emergence of various post-training methods, with GRPO being a notable innovation that enhances reinforcement learning paradigms [3][5]. Post-Training and Reinforcement Learning - Post-training is crucial for refining models' capabilities in specific domains, enhancing adaptability and flexibility to meet diverse application needs [12][11]. - Reinforcement learning, particularly through human feedback (RLHF), plays a vital role in the post-training phase, aiming to optimize model outputs based on user preferences [14][19]. GRPO and Its Advantages - GRPO eliminates the need for a separate critic model, reducing memory and computational costs significantly compared to PPO, which requires dual networks [30][35]. - The GRPO framework utilizes historical performance data to establish a baseline for evaluating model improvements, thus simplifying the training process [34][35]. Comparison of GRPO and PPO - GRPO offers substantial improvements in memory requirements and training speed, making it a more efficient choice for large language model training [37]. - Despite its advantages, GRPO still faces stability issues similar to those of PPO, particularly in smaller-scale reinforcement learning tasks [39]. Recent Innovations: DAPO, GSPO, and GFPO - DAPO introduces enhancements to GRPO, such as Clip-Higher and dynamic sampling, to address practical challenges encountered during training [41][42]. - GSPO advances the methodology by shifting the focus from token-level to sequence-level importance sampling, significantly improving training stability [48][49]. - GFPO allows for simultaneous optimization of multiple response attributes, addressing limitations of GRPO related to scalar feedback and multi-round reasoning tasks [61][63]. Conclusion - The evolution of post-training methods, from PPO to GRPO and beyond, illustrates a clear trajectory in optimizing large language models, with GRPO serving as a pivotal point for further advancements in the field [81][82].