ICLR 2026 | 大模型的无监督强化学习能走多远？清华团队给出了系统性答案

Core Insights - The article discusses the evolution of reinforcement learning (RL) from supervised to unsupervised methods, highlighting the limitations of purely supervised training due to the increasing costs of manual labeling and the challenges in obtaining reliable annotations in specialized fields [3][4] - Unsupervised RL with internal rewards has shown promise in enhancing model performance, but it also faces inherent limitations that can lead to performance degradation after initial improvements [4][14] - The research identifies a "pre-training indicator" that can predict a model's trainability before extensive training, which is crucial for optimizing resource allocation in RL [4][20] Group 1: Unsupervised RL Mechanisms - The article outlines the emergence of unsupervised RL methods that utilize internal signals for reward construction, categorized into two types: those based on certainty and those based on ensemble methods [7][10] - A unified theoretical framework is proposed to explain the underlying mechanism of these internal reward methods, revealing that they primarily sharpen existing model preferences rather than create new knowledge [10][14] - The research indicates that the success of these methods is contingent on the alignment of model confidence and correctness, suggesting that models with strong initial priors can benefit from internal rewards, while those with incorrect priors may face inevitable collapse [14][20] Group 2: Key Findings - Finding One: The degree of alignment between confidence and correctness is critical for the success of internal reward methods, with models exhibiting a tendency to collapse after a certain point in training [14][16] - Finding Two: In small-scale training scenarios, internal rewards can lead to stable performance improvements, even when starting from incorrect initial beliefs [16][17] - Finding Three: The "Model Collapse Step" metric is introduced as a lightweight indicator to assess a model's suitability for RL, allowing for predictions about its performance without extensive ground truth labeling [20][23] Group 3: External Reward Methods - Finding Four: External reward methods are identified as a scalable direction for unsupervised RL, utilizing unannotated data and asymmetric generation-validation processes to provide objective feedback [24][25][27] - The article emphasizes that external rewards focus on verifying the correctness of generated answers rather than reinforcing the model's self-confidence, which can lead to more sustainable improvements [27][28] - The distinction between internal and external rewards is framed as complementary tools, with the potential for external methods to unlock new possibilities in scalable unsupervised RL [29][30]