三门理论(Three-Gate Theory)
Search documents
这些大神在Meta的论文看一篇少一篇了
3 6 Ke· 2025-11-17 09:52
Core Insights - The article discusses a perplexing phenomenon in large model reinforcement learning (RL) training, where significant performance improvements occur despite minimal parameter changes [1][3]. Group 1: Research Findings - The paper analyzes the dynamics of Verifiable Reward Reinforcement Learning (RLVR) training, debunking the misconception that sparse parameter updates are merely superficial; instead, it reveals a fixed optimization bias inherent in RLVR [3][5]. - The research introduces a new framework called the Three-Gate Theory, which explains how RLVR parameter updates are directed towards specific parameter regions [5][7]. Group 2: Parameter Update Characteristics - The study highlights a paradox where RL training yields high performance gains with sparse parameter updates, contrasting with the dense updates seen in supervised fine-tuning (SFT) [5][6]. - The sparsity of updates in RL training ranges from 36% to 92%, while SFT shows sparsity between 0.6% and 18.8%, indicating a significant difference in update density [5][6]. Group 3: Three-Gate Theory Components - The first gate, KL Anchoring, ensures that RL updates do not deviate significantly from the model's original output style, maintaining a small drift in parameter movement [8]. - The second gate, Model Geometry, indicates that RL updates prefer low-curvature directions in the optimization landscape, preserving the model's original weight structure [9]. - The third gate, Precision, explains that the limited precision of bfloat16 can mask small updates in RL, leading to the appearance of sparsity [11]. Group 4: Implications for Parameter Efficient Fine-Tuning - The findings suggest that many parameter-efficient fine-tuning (PEFT) methods from the SFT era do not transfer well to RLVR, particularly those aligned with sparse or low-rank priors [17]. - The study indicates that updating non-principal, low-amplitude weights aligns better with RLVR's optimization trajectory, while methods like PiSSA may not provide additional benefits and can lead to instability [17].