奖励函数
Search documents
SFT的本质,其实是在优化RL目标的下界...
自动驾驶之心· 2025-10-22 00:03
Core Insights - The article establishes that under sparse rewards, the training objective of Standard Fine-Tuning (SFT) is a loose lower bound of the Reinforcement Learning (RL) objective, and introduces a bridge distribution to tighten this lower bound while maintaining training stability [1][9][23]. Group 1: Relationship Between SFT and RL - The training objective function for RL strategy gradient algorithms is defined, linking SFT and RL through the derivation of the objective function [4][3]. - SFT operates on a fixed set of labeled data, contrasting with RL's online sampling, which optimizes the strategy model based on reward values [5][9]. - The article demonstrates that SFT's optimization goal can be viewed as a lower bound of the RL objective, indicating that SFT training can yield some effectiveness [9][23]. Group 2: Importance Sampling and Adjustments - The article discusses the application of importance sampling to transition from online to offline sampling in the RL training objective [6][11]. - A key finding is that the lower bound of SFT may become looser as training progresses, necessitating adjustments to tighten this bound [9][11]. - The introduction of an auxiliary distribution is proposed to adjust the SFT training objective, allowing for a tighter lower bound while ensuring training stability [11][12]. Group 3: Properties of iw SFT - The iw SFT formulation incorporates a weight coefficient that can be freely adjusted, allowing for the tightening of the lower bound [11][13]. - The choice of the auxiliary distribution is critical; it should be close to the reference distribution to ensure a tight lower bound while maintaining stability [13][14]. - Two methods for constraining importance weights are proposed: clipping the importance weights and smoothing them to reduce variance [14][15]. Group 4: Practical Implications - The article illustrates the advantages of iw SFT through a multi-armed bandit example, showing how it can effectively utilize negative sample information to improve strategy convergence [18][19][20]. - The overall conclusion emphasizes the importance of understanding the relationship between SFT and RL, and how adjustments can enhance training outcomes [23].
我们找到3位大学教授,聊了聊越来越严重的AI幻觉
3 6 Ke· 2025-07-15 03:23
Group 1 - The recent incident involving DeepSeek highlights the issue of AI hallucinations, where the model fabricated events and referenced non-existent legal documents, raising concerns about the increasing hallucination rates in AI models [1][2] - OpenAI's o3 model has shown a significant increase in hallucination rates, with 33% of responses exhibiting hallucinations, nearly double that of its predecessor o1, and even higher rates in other models like o4-mini at 48% [1][2] - The phenomenon of hallucinations is linked to over-optimization in reinforcement learning (RL), where models may produce correct answers but through flawed reasoning processes, leading to a disconnect between output and logical reasoning [2][3] Group 2 - Experts suggest that the increase in hallucinations is indicative of a broader issue in understanding what humans truly want from AI, as models optimized for specific tasks may neglect the quality of their reasoning processes [3][4] - The reinforcement learning paradigm primarily rewards final outcomes, which can lead to models developing incorrect but efficient strategies, contributing to the hallucination phenomenon [3][4] - Current reinforcement learning methods, such as GRPO, have not effectively addressed the need for regularization in the reasoning process, resulting in models that may produce correct answers while lacking logical coherence [4][5] Group 3 - The design of reward functions in reinforcement learning remains a critical challenge, as it is difficult to create effective supervisory signals for the reasoning processes of large models [6][7] - There is a need for more sophisticated reward models that can provide feedback on the reasoning process itself, rather than solely on the final output, to mitigate hallucination issues [5][6] - The exploration of non-scalar feedback mechanisms, such as language-based feedback, could enhance the training of models by allowing them to adjust based on qualitative assessments rather than just numerical rewards [7][8] Group 4 - The current benchmarks for evaluating model reasoning capabilities are limited, as they often rely on fixed datasets that do not capture the flexibility of large language models [9][10] - The ability of models to generalize and perform well on varied tasks is still under scrutiny, with evidence suggesting that many models rely heavily on memorization rather than true reasoning [10][11] - Future advancements in model training will require a focus on dynamic interactions with complex environments to foster genuine learning and reasoning capabilities beyond mere imitation of human behavior [15][16]