策略梯度算法
Search documents
SFT的本质,其实是在优化RL目标的下界...
自动驾驶之心· 2025-10-22 00:03
Core Insights - The article establishes that under sparse rewards, the training objective of Standard Fine-Tuning (SFT) is a loose lower bound of the Reinforcement Learning (RL) objective, and introduces a bridge distribution to tighten this lower bound while maintaining training stability [1][9][23]. Group 1: Relationship Between SFT and RL - The training objective function for RL strategy gradient algorithms is defined, linking SFT and RL through the derivation of the objective function [4][3]. - SFT operates on a fixed set of labeled data, contrasting with RL's online sampling, which optimizes the strategy model based on reward values [5][9]. - The article demonstrates that SFT's optimization goal can be viewed as a lower bound of the RL objective, indicating that SFT training can yield some effectiveness [9][23]. Group 2: Importance Sampling and Adjustments - The article discusses the application of importance sampling to transition from online to offline sampling in the RL training objective [6][11]. - A key finding is that the lower bound of SFT may become looser as training progresses, necessitating adjustments to tighten this bound [9][11]. - The introduction of an auxiliary distribution is proposed to adjust the SFT training objective, allowing for a tighter lower bound while ensuring training stability [11][12]. Group 3: Properties of iw SFT - The iw SFT formulation incorporates a weight coefficient that can be freely adjusted, allowing for the tightening of the lower bound [11][13]. - The choice of the auxiliary distribution is critical; it should be close to the reference distribution to ensure a tight lower bound while maintaining stability [13][14]. - Two methods for constraining importance weights are proposed: clipping the importance weights and smoothing them to reduce variance [14][15]. Group 4: Practical Implications - The article illustrates the advantages of iw SFT through a multi-armed bandit example, showing how it can effectively utilize negative sample information to improve strategy convergence [18][19][20]. - The overall conclusion emphasizes the importance of understanding the relationship between SFT and RL, and how adjustments can enhance training outcomes [23].