自我博弈
Search documents
大模型训练新突破,Meta提出LSP:无数据也能实现能力飞升
3 6 Ke· 2025-09-22 01:48
Core Insights - The lack of high-quality data has become a bottleneck for the continuous learning and capability enhancement of large language models (LLMs) [1] - Meta has proposed a new reinforcement learning method called "Language Self-Play" (LSP) to enable models to self-improve without relying on additional data [1][2] Methodology - LSP utilizes a self-play framework, treating the model's capabilities as performance in competitive games, allowing it to generate stronger strategies through self-competition [2] - In the LSP framework, the same pre-trained LLM assumes two different roles: "Challenger" generates challenging queries, while "Solver" responds to these queries to maximize task rewards [3][5] Technical Innovations - LSP incorporates two core technologies: - Group Relative Policy Optimization (GRPO) allows the Challenger to generate multiple queries, and the Solver to provide various responses, establishing a quality evaluation benchmark [5] - KL Divergence Regularization prevents the model from deviating too far from the initial reference model, ensuring training effectiveness [5] Evolution of LSP - The initial version, LSP-Zero, relied solely on adversarial interactions but faced issues like "meaningless adversarial games" [6][7] - The upgraded LSP introduces a self-reward mechanism, where a reference model scores the quality of "Challenger query + Solver response," promoting high-quality interactions [7] Experimental Validation - Experiments using AlpacaEval benchmark demonstrated that LSP and LSP-Zero significantly improved the performance of the base model Llama-3.2-3B-Instruct, achieving comparable results to GRPO despite lacking training data [10][11] - LSP outperformed LSP-Zero, particularly in tasks requiring conversational prompts, showcasing its advantages in specific scenarios [11][14] Limitations - LSP showed slightly lower performance than GRPO in the Koala dataset, attributed to the structured nature of queries generated by LSP, which did not align well with the dataset's loose conversational style [16] Future Implications - The introduction of LSP addresses the data dependency issue in large model training and validates the feasibility of "data-free training," potentially reducing training costs and resource investments [17] - The self-play framework may exhibit significant potential for knowledge expansion once AI can collect its own experiential data [17]
清华&通院推出"绝对零"训练法,零外部数据大模型自我博弈解锁推理能力
量子位· 2025-05-12 04:11
克雷西 发自 凹非寺 量子位 | 公众号 QbitAI 不用引入外部数据 ,通过自我博弈(Self-play)就能让预训练大模型学会推理? 来自清华、北京通用人工智能研究院和宾夕法尼亚州立大学的研究人员,提出了一种名为 "绝对零" (Absolute Zero)的训练方式。 这种方法通过让大模型根据推理目标,自己生成并解决任务,便可以获得推理能力。 测试中,用"绝对零"训练出的模型,表现已 经超过了用专家标注样本训练的模型 。 并且"绝对零"方法只需在代码环境中训练,但可以让模型在数学推理上也取得显著进步。 这项研究也在Reddit上引发了讨论,开帖转载的网友惊叹:会自我进化的AI已经被解锁了? Proposer负责生成新的推理任务,Solver负责解决这些任务。通过两个角色的交替和协同,模型可以自主地构建学习任务分布,并在求解任 务的过程中不断提升推理能力。 "绝对零"将所有的推理任务统一表示为 (p,i,o) (即程序,输入,输出)的三元组形式。 这里的程序是一段可执行的代码,输入是该程序的输入数据,输出是程序在给定输入下的输出结果。 在出题-做题中自我学习 "绝对零"采用了一种自我博弈的学习范式。在这个 ...