Workflow
SFT
icon
Search documents
SFT的本质,其实是在优化RL目标的下界...
自动驾驶之心· 2025-10-22 00:03
作者 | 欲壑难填@知乎 转自 | SFT 其实在优化 RL 目标的下界 原文链接: https://zhuanlan.zhihu.com/p/1950847739404456574 点击下方 卡片 ,关注" 大模型之心Tech "公众号 戳我-> 领取大模型巨卷干货 本文只做学术分享,如有侵权,联系删文 ,欢迎添加小助理微信AIDriver004做进一步咨询 TL;DR:本文推导出在稀疏奖励的情况下, 标准 SFT 的训练目标其实是 RL 目标的一个(松的)下界,为了收紧这个下界同时 保持训练稳定,作者引入了一个桥梁分布 来进行调节。最终在形式上得到了一个重要性加权版本的 SFT 目标。 论文链接:https://arxiv.org/abs/2507.12856 SFT 的优化目标是 RL 的下界 在 SFT 的设定下,我们只有 "好的" 回复数据。从 RL 的视角来看,这可以理解为我们有一个打分函数 能够区分出好的回 复和差的回复,并据此构建一个奖励函数 ,只对打分值为正的样本给出奖励值 1,其他样本奖励值均为 首先,我们通过目标函数的推导,将 SFT 和 RL 联系起来。 RL 策略梯度算法中,训练策略 ...
后训练的「分」与「合」,SFT&RL 大一统才是正解?
机器之心· 2025-09-14 01:30
Group 1 - The article discusses the limitations of the traditional "SFT followed by RL" paradigm in post-training for AI models, suggesting a unified approach that combines both methods [7][9][10] - It highlights the importance of post-training in aligning the model's capabilities with human values and preferences, addressing the challenges of "catastrophic forgetting" and overfitting associated with SFT [8][11][12] - The emerging trend in the industry is to explore a unified framework for post-training that leverages the strengths of both SFT and RL, rather than treating them as separate processes [10][15][17] Group 2 - The article evaluates the competitive landscape of AI hardware among major players like Meta, OpenAI, Apple, and Google, questioning whether AI hardware will become a new essential or merely a passing trend [2] - It raises questions about the user experience with AI hardware, such as whether it will truly replace traditional devices or simply serve as an additional feature [2][3] - The potential for innovative AI hardware forms to integrate seamlessly into daily life is explored, along with the implications for user interaction and technology adoption [2][3] Group 3 - The article examines the role of generative AI in search, debating whether it will serve as a replacement for traditional search engines or act as a growth engine for expanding user queries and intentions [3] - It discusses how multimodal interactions and conversational AI are redefining task completion for users, potentially enhancing the value of advertising and commercial opportunities [3] - Google's strategy of gradually integrating AI capabilities into its products, rather than waiting for full technological maturity, reflects a proactive approach to product development and market positioning [3]
大模型微调到底有没有技术含量,或者说技术含量到底有多大?
自动驾驶之心· 2025-08-10 23:32
Core Viewpoint - The article emphasizes the importance of individual approaches and methodologies in the field of large language models (LLMs), particularly in the context of fine-tuning and data quality, suggesting that the technical depth of work in this area is highly dependent on personal engagement and practices [5][16]. Data Work - Method 1 involves inheriting training data from colleagues without checking data quality, which may lead to suboptimal results [7]. - Method 2 suggests downloading open-source data to create a "system + query + answer" dataset [8]. - Method 3 focuses on generating data using GPT-4, emphasizing the diversity of prompts and the importance of data quality checks [8]. - Method 4 advocates using user interaction logs to drive data construction, analyzing user feedback to improve answer quality [9]. - Method 5 recommends breaking down complex tasks at the data level to enhance model performance [9]. Training Code - Method 1 involves inheriting training code and making minimal modifications [11]. - Method 2 encourages a thorough understanding of training code parameters and their implications [11]. - Method 3 promotes questioning and improving training code, such as optimizing speed and framework choices [12]. Experimental Analysis - Method 1 suggests running prepared evaluation sets and addressing data quality issues based on results [14]. - Method 2 involves analyzing bad cases from models to identify underlying issues and designing experiments to validate findings [14]. - Method 3 emphasizes the relationship between model results, data quality, and training methods, advocating for a comprehensive analysis of training logs and evaluation results [15]. Community and Collaboration - The article highlights the establishment of a large community focused on various aspects of autonomous driving technology, including large models and multi-sensor fusion, with nearly 4,000 members and over 300 companies and research institutions involved [18].
OpenThoughts: Data Recipes for Reasoning Models — Ryan Marten, Bespoke Labs
AI Engineer· 2025-07-19 21:10
[Music] I'm Ryan. I'm a founding engineer at Bespoke Labs. And today I'm going to talk to you about Open Thoughts, which is our project to create the best open-source reasoning data sets.And I'll be switching tack a little bit from our earlier discussions on reasoning and RL and focus on the reasoning part and you'll see why. So just so we're on the same page, we've talked a lot about reasoning, but what's actually going on here. So I like this graph from JSON which shows this incredible performance that's ...