Meta超级智能实验室新论文陷争议!被指忽略大量前人研究

Core Viewpoint - Meta's Super Intelligence Lab (MSL) faces controversy over its second paper titled "Language Self-Play For Data-Free Training," which has been criticized for neglecting prior research and lacking innovation [2][25]. Summary by Sections Overview of the Paper - The core idea of the paper is to utilize a method called Language Self-Play (LSP) to enable large language models to self-improve without additional training data [3][4]. - LSP addresses the challenge of large language models' heavy reliance on extensive, high-quality training data, which is often limited [4]. Methodology - LSP designs the learning process as a game framework where the same language model plays two roles in opposition, allowing for data-free training [5]. - In this adversarial process, the challenger generates increasingly difficult questions or commands to lower the expected rewards of the resolver, who must understand and respond to maximize their own rewards, akin to a minimax game [7]. - Unlike traditional adversarial training, LSP allows a single language model to act as both "challenger" and "resolver," using a special "Challenger Prompt" to switch roles [8]. Implementation and Challenges - The research introduces a reinforcement learning technique called GRPO to convert the game into a model training process [9]. - A reward mechanism is established where the challenger's questions target the resolver's weaknesses, driving continuous improvement [10]. - The method is termed Language Self-Play Zero (LSP-Zero), indicating a zero-sum nature [11]. - However, LSP-Zero can sometimes degrade, leading the model to generate meaningless content that scores high due to reward hacking [12]. Enhancements - To mitigate this issue, the researchers incorporated a "self-quality reward" (RQ) into the LSP algorithm, guiding the game towards high-quality interactions for sustainable training [13]. Experimental Results - Experiment 1 compared LSP and LSP-Zero with a traditional data-driven model, showing that LSP methods performed comparably to data-driven approaches and significantly outperformed the original model [18]. - In a dialogue and open instruction dataset, LSP's performance exceeded that of GRPO [18]. - Experiment 2 further trained a model using LSP-Zero and LSP, resulting in an increase in overall win rates from 40.9% to 43.1% [21]. - LSP demonstrated particularly notable improvements on the Vicuna dataset, indicating its effectiveness in continuing to unlock model potential post data-driven training [22][24]. Criticism and Response - Critics argue that MSL's work overlooks significant prior research, with various researchers having conducted similar studies without proper citation [25][26]. - The paper has been described as potentially rehashing older work, raising questions about its originality [30]. - As of now, MSL and the authors have not responded to these criticisms [31].