1-shot RLVR - filings, earnings calls, financial reports, news

1-shot RLVR

Search documents

机器之心· 2025-05-09 09:02

Core Insights - The article discusses significant advancements in large language models (LLMs) regarding reasoning capabilities, particularly in complex mathematical tasks, driven by Reinforcement Learning with Verifiable Reward (RLVR) [1][2]. Group 1: Research Findings - Researchers from the University of Washington and Microsoft found that using just one training data point (1-shot RLVR) can significantly enhance model performance in various mathematical reasoning tasks [2][3]. - The performance of Qwen2.5-Math-1.5B improved from 36.0% to 73.6% and Qwen2.5-Math-7B from 51.0% to 79.2% on the MATH500 dataset using 1-shot RLVR, achieving results comparable to using a larger dataset of 1.2k [3][13]. - The 1-shot RLVR approach also demonstrated effectiveness in non-mathematical reasoning tasks, such as ARC-Easy and ARC-Challenge [5]. Group 2: Methodology and Data Selection - The study employed a combination of policy gradient loss, KL divergence loss, and entropy loss in the training process, with a focus on policy gradient loss as the primary driver of improvement [7][19]. - Researchers utilized a metric called historical variance score to prioritize data selection from the dataset, although this method was not deemed optimal [8][19]. - The findings indicated that 1-shot RLVR could generalize well across different mathematical themes, suggesting that a single training example from one topic could enhance performance in others [13][16]. Group 3: Observations and Implications - The phenomenon of saturation and generalization was observed, where training accuracy approached 100% quickly, but downstream task performance continued to improve [10][11]. - The study highlighted the importance of encouraging exploration through entropy loss, which contributed to better performance in 1-shot RLVR [20]. - The results support previous conclusions that foundational models used for RLVR often possess inherent reasoning capabilities that can be activated with minimal data [22].

Llama-3.2-3B-Instruct

Llama-3.2-3B-Instruct