自博弈
Search documents
大模型如何泛化出多智能体推理能力?清华提出策略游戏自博弈方案MARSHAL
机器之心· 2026-01-09 04:08
Core Insights - The MARSHAL framework, developed by Tsinghua University and other institutions, utilizes reinforcement learning for self-play in strategy games, significantly enhancing the reasoning capabilities of large models in multi-agent systems [2][7][31] - The framework addresses two main challenges in multi-agent systems: credit assignment in multi-round interactions and advantage estimation among heterogeneous agents [5][7] Background and Challenges - Existing models like DeepSeek-R1 have shown the value of verifiable reward reinforcement learning (RLVR) in single-agent scenarios, but its application in complex multi-agent interactions is still in exploration [5] - The two core technical challenges identified are: 1. Credit assignment in multi-round interactions, where existing methods struggle to accurately trace back results to specific actions [5] 2. Advantage estimation among heterogeneous agents, which complicates joint training and leads to performance volatility [7] MARSHAL Method Introduction - MARSHAL employs Group-Relative Policy Optimization (GRPO) architecture and introduces two key algorithmic improvements to enhance multi-agent reasoning capabilities [12][14] - The framework was tested using six strategy games, with three for training and three for testing, covering a range of competitive and cooperative scenarios [12] Core Experiments - The MARSHAL-trained expert agents demonstrated a significant performance increase, achieving up to 28.7% higher win rates in testing games [13][19] - The model showed remarkable generalization capabilities, with accuracy improvements of 10.0% in AIME and 7.6% in GPQA across various reasoning tasks [19][20] Reasoning Mode Analysis - Qualitative analysis revealed that the training in games fostered two emergent capabilities: Role-Awareness and Intent Recognition, which are crucial for decision-making in uncertain environments [22] - Quantitative analysis indicated that MARSHAL reduced inter-agent misalignment by 11.5%, enhancing communication efficiency among agents [24] Ablation Studies - Self-play training outperformed fixed opponent training, as models trained against fixed opponents tended to overfit, leading to poor performance in testing scenarios [26] - The necessity of the Turn-level Advantage Estimator and Agent-specific Advantage Normalization was confirmed, highlighting their importance in handling long-sequence decisions and addressing reward distribution differences [28] Conclusion - The MARSHAL framework successfully enhances the reasoning capabilities of large language models in multi-agent systems through self-play in strategy games, indicating potential for broader applications in complex multi-agent environments [31][34]
Vision-Zero:零数据VLM自我进化!陈怡然团队提出零监督训练新范式
机器之心· 2025-10-11 03:29
Core Insights - The article discusses the development of Vision-Zero, a self-play framework designed for Vision-Language Models (VLM), which aims to overcome the limitations of traditional training methods that rely heavily on human-annotated data and reinforcement learning rewards [6][7][26]. Background - VLMs have shown impressive performance in multimodal tasks, but they face challenges such as data scarcity due to high annotation costs and a knowledge ceiling that limits model capabilities [6]. - The Vision-Zero framework introduces a self-play strategy that allows VLMs to generate complex reasoning data autonomously, eliminating the need for manual annotation [6]. Framework Characteristics - Vision-Zero employs a self-play framework based on social reasoning games, enabling agents to generate high-complexity reasoning data during self-play [6]. - It allows any form of image as input, enhancing the model's ability to generalize across various domains [6]. - The framework incorporates an iterative self-play policy optimization algorithm that addresses performance bottlenecks common in traditional self-play methods [7]. Game Design - Inspired by social reasoning games, Vision-Zero includes a set of rules where agents must deduce hidden roles based on subtle differences in images, fostering complex reasoning chains [12][15]. - The game requires only two images with slight differences, making data construction simple and cost-effective [17]. Training Methodology - The framework utilizes a dual-phase alternating training approach to avoid local equilibrium and knowledge saturation, enhancing the model's ability to explore new reasoning paths [20]. - This method has shown to significantly outperform single-phase training in various tasks [20]. Experimental Results - Vision-Zero demonstrates strong task generalization capabilities, outperforming state-of-the-art methods that require annotated data across multiple benchmark datasets [22]. - The models trained under Vision-Zero effectively mitigate negative transfer issues commonly seen in VLMs, maintaining performance across different tasks [24]. Implications - Vision-Zero illustrates the feasibility and potential of self-play in transitioning from single-task to general-task applications, breaking free from the constraints of manual annotation and knowledge limitations [26].
OpenAI拿下IOI金牌,仅次于前五名人类选手!参赛推理模型才夺得IMO金牌
创业邦· 2025-08-12 03:33
Core Viewpoint - OpenAI's reasoning model achieved a gold medal score at the 2025 International Olympiad in Informatics (IOI), ranking first among AI participants and demonstrating significant advancements in general reasoning capabilities [2][9][16]. Group 1: Competition Performance - OpenAI participated in the online AI track of IOI 2025, scoring just behind five human competitors among 330 participants, securing the top position among AI competitors [6][8]. - The model used by OpenAI was not specifically trained for IOI but was based on a general reasoning model that performed exceptionally well [8][14]. - Compared to last year's performance, OpenAI's score improved dramatically from the 49th percentile to the 98th percentile, showcasing a leap in capabilities [9]. Group 2: Model and Strategy - OpenAI utilized the same model that won gold at the International Mathematical Olympiad (IMO) 2025 without any modifications for the IOI competition [14][15]. - The strategy involved sampling answers from different models and using a heuristic method to select submissions, which contributed to the successful outcome [14]. Group 3: Community Reaction and Future Implications - The achievement has sparked excitement in the community, highlighting the growing strength of general reasoning abilities without specialized training [16]. - There is anticipation for OpenAI to release a public version of the technology that led to the gold medal performance, indicating potential for further advancements in AI capabilities [18].