软奖励

Search documents
大模型RL不止数学代码!7B奖励模型搞定医学法律经济全学科, 不用思维链也能做题
量子位· 2025-04-02 07:40
Core Insights - The article discusses a new framework called RLVR developed by Tencent and Suzhou University, which extends reinforcement learning training to various disciplines beyond mathematics and coding, including medicine, chemistry, law, psychology, and economics [3][4]. Group 1: Framework and Methodology - RLVR utilizes a model-based soft reward system, which shows significant improvements in generalization, robustness, and scalability compared to traditional binary rule-based rewards [4]. - The research is based on the observation that when tasks have objective reference answers, different large language models exhibit high consistency in binary judgments (correct/incorrect) [7]. - The team distilled a 7B reward model from a 72B parameter model (Qwen2.5-Instruct) without requiring domain-specific annotations, relying solely on data collected during the online exploration phase [9]. Group 2: Experimental Results - The study sampled 6,000 questions from ExamQA, covering a wide range of subjects in science, engineering, and humanities [12]. - The RM-7B model demonstrated superior performance in free-form answer tasks compared to various baseline models, including base models, fine-tuned models, and rule-based reinforcement learning [14]. - The RM-7B model achieved an average score of 62.5 in multi-subject tasks, outperforming other methods in both binary and soft reward categories [15]. Group 3: Scalability and Future Research - The research indicates that model-based rewards have better scalability when data volume increases, suggesting a more effective approach for handling non-structured reference answers [18]. - The authors note that while chain-of-thought reasoning (CoT) is beneficial in various scenarios, its necessity for evaluating semantic equivalence between reference answers and model responses remains an open question [16]. - The study does not impose format constraints on reference answers or model responses, which reduces the labor involved in data standardization, but the role of format-related constraints and rewards needs further examination [17].