大模型后训练
Search documents
8块钱跑通一次强化学习全流程,潞晨云重塑微调赛道:1名算法工程师=1支Infra团队
量子位· 2026-01-07 05:17
Core Viewpoint - The article discusses the shift in large model training from "violent pre-training" to "post-training," emphasizing the importance of fine-tuning and reinforcement learning (RL) in enhancing model performance [1][2]. Group 1: Post-Training and Reinforcement Learning - The industry consensus is that breakthroughs in large model capabilities now rely more on post-training, particularly RL, rather than solely on pre-training parameter accumulation [7]. - DeepSeek-R1's performance improvement in AIME mathematical reasoning benchmark, with pass@1 increasing from 15.6% to 77.9% through RL, exemplifies the potential of RL in achieving significant capability leaps with limited data [7]. Group 2: Challenges in Algorithm Engineering - Algorithm engineers face significant challenges due to complex distributed infrastructure, high GPU rental costs, and intricate architecture tuning, which hinder access to advanced training environments [3][9]. - The introduction of Tinker aims to simplify the training process by providing a standard API, decoupling algorithm design from infrastructure, allowing developers to focus on data and loss function definitions [10]. Group 3: Efficiency and Cost Structure - The Luchenchun Fine-Tuning SDK allows a single algorithm engineer to replace a large infrastructure team, significantly enhancing productivity by simplifying the training process [12][16]. - The SDK's serverless architecture introduces a "pay-per-token" billing model, which charges users only for effective computation tokens used during prefill, sample, and training, eliminating costs associated with idle GPU time [26][29]. Group 4: Practical Applications and User Experience - The SDK supports various use cases, including academic research, startup MVP validation, and industrial applications, enabling users to conduct experiments without the burden of resource management [32][35][37]. - Users can easily train large models using familiar Python syntax, with the SDK providing a seamless experience from installation to execution, thus lowering the barrier to entry for complex model training [39][41]. Group 5: Future of AI Infrastructure - The ultimate goal of AI infrastructure is to achieve "zero cognitive load," where developers only need to describe data and algorithms, while all operational complexities are managed by the system [42]. - As GPU idle costs approach zero and environment setup times decrease, the efficiency of application innovation will be maximized, pushing the limits of computational capabilities [43].
OpenAI前CTO首个创业产品Tinker,这里全量升级开放了,还有羊毛可薅
机器之心· 2026-01-07 05:16
Core Insights - The article discusses the launch of the Luchenyun Fine-tuning SDK, which is based on the Tinker SDK from Thinking Machines Lab, marking a shift from "craft-style" model training to "industrialized fine-tuning" [1][3][26] - The SDK allows developers to focus on algorithm design while abstracting away the complexities of distributed training infrastructure, enabling a more efficient and cost-effective approach to fine-tuning large models [4][6][26] Group 1: Technological Advancements - The introduction of Tinker SDK simplifies the training process by providing standard APIs for various training functions, allowing developers to define data and loss functions without worrying about infrastructure [4][6] - The SDK supports both supervised fine-tuning (SFT) and complex reinforcement learning (RL) pipelines, enabling users to easily construct training flows using atomic functions [8][24] Group 2: Cost Structure and Efficiency - The Luchenyun SDK adopts a serverless architecture with a "pay-per-token" pricing model, which allows users to only pay for effective computation tokens used during prefill, sampling, and training, while other processes are free [14][18] - This new pricing model significantly reduces wasted budget on non-productive time, as users are no longer charged for GPU usage during data loading or debugging [14][18] Group 3: User Experience and Accessibility - The SDK provides a seamless experience for users, allowing them to work in familiar environments like Jupyter Notebook with standard Python syntax, thus enhancing productivity [8][10] - The system includes an intelligent queue that ensures tasks are executed promptly, with no charges during waiting periods, optimizing resource utilization [12] Group 4: Target Users and Applications - The SDK is designed to cater to various user groups, including researchers who can conduct experiments without worrying about infrastructure, and startups that require rapid validation of MVPs [19][20] - In industrial applications, the SDK allows engineers to define loss logic and reinforcement learning reward functions, providing complete control over model training [21] Group 5: Future Outlook - The article emphasizes that post-training is evolving from an academic niche to a mainstream engineering focus, aiming for a "zero cognitive load" experience for developers [26] - The Luchenyun Fine-tuning SDK is now fully open for use, with promotional offers for early adopters, indicating a push for widespread adoption [27][28]
华泰证券今日早参-20251204
HTSC· 2025-12-04 01:43
Group 1: Macroeconomic Insights - The Japanese central bank's potential interest rate hike in December could lead to an increase in government bond yields, influenced by high inflation and upcoming fiscal stimulus [2][3] - Global macroeconomic and policy expectations have been recalibrated, with service sector PMIs in the US, Europe, and Japan remaining high, while manufacturing PMIs have declined [3] - The market is experiencing fluctuations in response to the Federal Reserve's interest rate outlook, with mixed performances in US stock indices and a decline in oil prices [3] Group 2: Fixed Income Analysis - Cross-period price differences in interest rate derivatives are influenced by the CTD bond's coupon rate, full price, and three-month repo rates, along with market sentiment [4] - The movement of contracts during the roll period indicates strong participation in positive spreads, leading to an initial increase in cross-period price differences [4] Group 3: Consumer Sector Opportunities - The consumer sector is witnessing structural changes driven by technology and innovation, with new consumption trends emerging in areas like trendy toys, beauty products, and ready-to-drink beverages [6] - Investment strategies should focus on four main themes: the rise of domestic brands, technology-enabled consumption, emotional spending, and undervalued high-dividend blue-chip stocks [6] Group 4: Aerospace and Defense - The development of reusable rockets is crucial for reducing costs and increasing capacity in space activities, with companies like SpaceX leading the way [7] - China's advancements in reusable rocket technology, such as the Zhuque-3 and Long March 12, are expected to enhance space launch capabilities and reduce costs [7] Group 5: Energy Sector Analysis - Xin'ao Energy's privatization process is progressing, with key regulatory approvals completed, and the company is showing strong operational performance in natural gas retail [8] - The company's fundamentals are improving, supported by expanding projects and increasing customer penetration rates, leading to a positive long-term outlook [8] Group 6: Rating Changes - Recent adjustments in stock ratings include upgrades for companies like Hayuan Engineering and new buy ratings for firms such as Aerospace Intelligence Manufacturing and BOSS Zhipin, reflecting positive earnings forecasts [9]
北大彭一杰教授课题组提出RiskPO,用风险度量优化重塑大模型后训练
机器之心· 2025-10-15 02:54
Core Insights - The article discusses the limitations of traditional reinforcement learning (RL) methods in enhancing the reasoning capabilities of large models, particularly highlighting the "mean optimization trap" that leads to a lack of exploration and ineffective learning in challenging tasks [4][24]. - A new approach called RiskPO is introduced, which integrates risk-averse principles into the optimization objective, focusing on the left tail of the reward distribution to guide models in overcoming reasoning shortcomings [7][24]. Research Background and Challenges - The article outlines the challenges faced by large models in post-training, particularly the "mean optimization trap" that results in a loss of exploration ability and ineffective learning in difficult tasks [4][24]. - It emphasizes that existing methods, such as GRPO, have improved short-term metrics but have not expanded the reasoning boundaries necessary for complex tasks [4][24]. Technical Solution Overview - The RiskPO approach combines "risk measurement" with a bundling strategy to address the shortcomings of traditional mean optimization [6][7]. - The core of this approach is the "Mixed Value at Risk (MVaR)" objective function, which emphasizes the importance of low-reward, difficult tasks by replacing the pursuit of overall mean rewards [9][10]. Experimental Results - The North University team demonstrated the effectiveness of RiskPO across various tasks, achieving significant improvements in reasoning capabilities, particularly in challenging problems [15][18]. - In the AIME24 competition, RiskPO outperformed GRPO by nearly 7 percentage points in Pass@32 and achieved a Pass@1 score of 81.8% on the MATH500 dataset, surpassing GRPO by 2.6 percentage points [15][16]. Theoretical Support and Validation - The performance improvements of RiskPO are backed by solid theoretical foundations and rigorous ablation studies, showing that risk-averse updates can effectively mitigate entropy collapse [20][21]. - The article highlights that while mean-based metrics may show similar performance in early training, risk-sensitive metrics reveal significant advantages for RiskPO as training progresses [23][24]. Comparison with Alternative Strategies - A comparison with risk-seeking strategies demonstrated that focusing on easier tasks leads to rapid entropy collapse and stagnation in performance, while risk-averse strategies drive continuous improvement [26][27].
NeurIPS 25 | GRPO进阶版来了,GVPO重构大模型后训练范式
机器之心· 2025-10-14 02:06
Core Viewpoint - Post-training of large models is becoming a key aspect of AI evolution, focusing on enhancing reasoning capabilities, aligning with human preferences, and maintaining stability and efficiency [1]. Summary by Sections GVPO Introduction - The team from Zuoyebang and Hong Kong University of Science and Technology proposed a new method called GVPO (Group Variance Policy Optimization) to address the instability issues of GRPO (Generalized Reward Policy Optimization) [2]. Design Motivation - Inspired by DPO (Direct Preference Optimization), the research team aims to maximize rewards under KL constraints in the GRPO scenario, which involves multiple samplings for each prompt [5]. Practical Challenges - A significant challenge is the expectation calculation of Z(x) across all possible samples, which is nearly impractical. The team found that ensuring the sum of gradient weights for all samples under the same prompt equals zero allows Z(x) to cancel out, thus avoiding this computational difficulty [6]. Key Advantages of GVPO 1. **Unique Optimal Solution Guarantee**: GVPO's MSE form provides a strict mathematical proof that it achieves a unique optimal solution when R_θ equals R, ensuring algorithm effectiveness and stability [13]. 2. **No Need for Importance Sampling**: GVPO's optimal solution has minimal restrictions on sampling distribution, allowing for off-policy training without the common instability issues associated with importance sampling [14]. Analytical Perspectives - GVPO can be understood from three complementary analytical perspectives, each corresponding to an equivalent loss function: 1. **Negative Log-Likelihood Perspective (NLL)**: GVPO's loss function can be viewed as a weighted negative log-likelihood, allowing for flexible integration of historical and heterogeneous data sources [17]. 2. **Mean Squared Error Perspective (MSE)**: The optimization goal is to minimize the deviation between implicit and actual rewards, ensuring convergence to a unique global optimal solution under KL constraints [18]. 3. **Reinforcement Learning Perspective (RL)**: This perspective highlights the three components of the GVPO loss function, emphasizing the balance between actual and predicted rewards [19]. Experimental Results - In mathematical reasoning tasks, GVPO outperformed GRPO and its improved version Dr.GRPO across five benchmark tests, significantly enhancing the base model's performance [21]. - Ablation studies indicate GVPO's insensitivity to hyperparameter β and its excellent scalability with increased sampling numbers, allowing smaller models to match larger ones [23]. Significance and Future Prospects - GVPO represents a paradigm shift in post-training, moving from experience-driven approaches to those with theoretical guarantees, enhancing stability, flexibility, and efficiency in large model training [25][26].
真正的AI竞争力,藏在大模型“后训练”这一步
量子位· 2025-10-13 08:47
Core Insights - The article emphasizes the importance of Post-Training as a transformative approach in AI, moving beyond simple model optimization to creating specialized intelligent engines tailored to specific business needs [1][4] - The evolution of Post-Training technology is highlighted, showcasing a shift from Supervised Fine-Tuning (SFT) to Reinforcement Learning (RL) methodologies, which better align with complex business requirements [2][4] Summary by Sections Post-Training Evolution - The initial approach in the industry was SFT, which allowed models to learn specific domain knowledge and dialogue styles [2] - However, SFT was insufficient for teaching models complex value judgments and strategic choices, which are critical in real business scenarios [3] - The focus has shifted to RL, evolving from human-dependent methods (RLHF) to automated systems (RLVR) and the innovative use of Natural Language Rewards [4][5] Implementation Pathway - The article outlines a four-step pathway for enterprises to implement Post-Training effectively, addressing challenges such as data quality, high labeling costs, and defining reward signals [5][8] - Successful case studies from companies like Zhihu, AutoHome, and Weibo illustrate practical applications of these steps, showcasing improvements in data quality and model performance [7][8] Step 1: Data Preparation - High-quality data is identified as the cornerstone of successful Post-Training, with companies spending 60-70% of their time on data preparation [10] - Zhihu and AutoHome have developed methods to enhance data quality through pre-labeling and structured data utilization, respectively [11][13] Step 2: Model Selection - Choosing the right base model is crucial, with many companies opting for the Tongyi Qianwen series due to its performance and support for Post-Training [14][16] - The model's architecture and open-source ecosystem facilitate easier implementation of Post-Training techniques [15][18] Step 3: Reward Mechanism Design - The design of a reward mechanism is essential for aligning model outputs with business objectives, transitioning from human feedback to automated verification systems [24][25] - Companies like Yingmi Fund are exploring ways to integrate expert decision-making frameworks into their models to enhance performance [26] Step 4: Evaluation System - A robust evaluation system is necessary to measure the effectiveness of Post-Training, with Yingmi Fund developing benchmarks to assess model performance in real-world scenarios [27][28] - Successful implementations have led to significant improvements in model accuracy and business outcomes, as seen in the case of Baifeng Cloud and Quark [30][32] Conclusion - The article concludes that the true competitive advantage in AI lies in how companies leverage their unique data and business insights through Post-Training to create proprietary intelligent engines [32]
科普向:一文解构大模型后训练,GRPO和它的继任者们的前世今生
3 6 Ke· 2025-09-01 04:38
Group 1 - The core concept of the article revolves around the evolution of post-training methods in large language models, particularly focusing on the GRPO algorithm as a significant advancement in reinforcement learning paradigms [2][46]. - GRPO has emerged as a universal reinforcement learning algorithm applicable to a wide range of post-training tasks, with notable improvements over previous methods like PPO [2][48]. - The article discusses the importance of post-training in enhancing the adaptability and flexibility of models, addressing the limitations of pre-training alone [5][46]. Group 2 - The article highlights the transition from PPO to GRPO, emphasizing the reduction of computational costs and memory requirements, making GRPO a more efficient alternative [18][14]. - GRPO's methodology involves using historical performance data to establish a baseline for advantage estimation, eliminating the need for a separate value function [16][14]. - Despite its advantages, GRPO still faces stability issues, prompting further research and development of improved algorithms like DAPO and GSPO [19][48]. Group 3 - DAPO, developed by ByteDance and Tsinghua AIR, builds upon GRPO by introducing enhancements such as Clip-Higher and dynamic sampling to improve training efficiency [20][21]. - GSPO represents a significant advancement by shifting the focus from token-level to sequence-level importance sampling, which enhances training stability [28][30]. - GFPO addresses the limitations of GRPO by allowing for the simultaneous optimization of multiple response attributes, thus improving the overall performance of models [33][34].
科普向:一文解构大模型后训练,GRPO和它的继任者们的前世今生
机器之心· 2025-09-01 02:49
Core Viewpoint - The article discusses the evolution and significance of the Group Relative Policy Optimization (GRPO) algorithm in the context of large language models and reinforcement learning, highlighting its advantages and limitations compared to previous methods like Proximal Policy Optimization (PPO) [4][38]. Summary by Sections Development of Large Language Models - The rapid advancement of large language models has led to the emergence of various post-training methods, with GRPO being a notable innovation that enhances reinforcement learning paradigms [3][5]. Post-Training and Reinforcement Learning - Post-training is crucial for refining models' capabilities in specific domains, enhancing adaptability and flexibility to meet diverse application needs [12][11]. - Reinforcement learning, particularly through human feedback (RLHF), plays a vital role in the post-training phase, aiming to optimize model outputs based on user preferences [14][19]. GRPO and Its Advantages - GRPO eliminates the need for a separate critic model, reducing memory and computational costs significantly compared to PPO, which requires dual networks [30][35]. - The GRPO framework utilizes historical performance data to establish a baseline for evaluating model improvements, thus simplifying the training process [34][35]. Comparison of GRPO and PPO - GRPO offers substantial improvements in memory requirements and training speed, making it a more efficient choice for large language model training [37]. - Despite its advantages, GRPO still faces stability issues similar to those of PPO, particularly in smaller-scale reinforcement learning tasks [39]. Recent Innovations: DAPO, GSPO, and GFPO - DAPO introduces enhancements to GRPO, such as Clip-Higher and dynamic sampling, to address practical challenges encountered during training [41][42]. - GSPO advances the methodology by shifting the focus from token-level to sequence-level importance sampling, significantly improving training stability [48][49]. - GFPO allows for simultaneous optimization of multiple response attributes, addressing limitations of GRPO related to scalar feedback and multi-round reasoning tasks [61][63]. Conclusion - The evolution of post-training methods, from PPO to GRPO and beyond, illustrates a clear trajectory in optimizing large language models, with GRPO serving as a pivotal point for further advancements in the field [81][82].