MARSHAL框架
Search documents
大模型如何泛化出多智能体推理能力?清华提出策略游戏自博弈方案MARSHAL
机器之心· 2026-01-09 04:08
Core Insights - The MARSHAL framework, developed by Tsinghua University and other institutions, utilizes reinforcement learning for self-play in strategy games, significantly enhancing the reasoning capabilities of large models in multi-agent systems [2][7][31] - The framework addresses two main challenges in multi-agent systems: credit assignment in multi-round interactions and advantage estimation among heterogeneous agents [5][7] Background and Challenges - Existing models like DeepSeek-R1 have shown the value of verifiable reward reinforcement learning (RLVR) in single-agent scenarios, but its application in complex multi-agent interactions is still in exploration [5] - The two core technical challenges identified are: 1. Credit assignment in multi-round interactions, where existing methods struggle to accurately trace back results to specific actions [5] 2. Advantage estimation among heterogeneous agents, which complicates joint training and leads to performance volatility [7] MARSHAL Method Introduction - MARSHAL employs Group-Relative Policy Optimization (GRPO) architecture and introduces two key algorithmic improvements to enhance multi-agent reasoning capabilities [12][14] - The framework was tested using six strategy games, with three for training and three for testing, covering a range of competitive and cooperative scenarios [12] Core Experiments - The MARSHAL-trained expert agents demonstrated a significant performance increase, achieving up to 28.7% higher win rates in testing games [13][19] - The model showed remarkable generalization capabilities, with accuracy improvements of 10.0% in AIME and 7.6% in GPQA across various reasoning tasks [19][20] Reasoning Mode Analysis - Qualitative analysis revealed that the training in games fostered two emergent capabilities: Role-Awareness and Intent Recognition, which are crucial for decision-making in uncertain environments [22] - Quantitative analysis indicated that MARSHAL reduced inter-agent misalignment by 11.5%, enhancing communication efficiency among agents [24] Ablation Studies - Self-play training outperformed fixed opponent training, as models trained against fixed opponents tended to overfit, leading to poor performance in testing scenarios [26] - The necessity of the Turn-level Advantage Estimator and Agent-specific Advantage Normalization was confirmed, highlighting their importance in handling long-sequence decisions and addressing reward distribution differences [28] Conclusion - The MARSHAL framework successfully enhances the reasoning capabilities of large language models in multi-agent systems through self-play in strategy games, indicating potential for broader applications in complex multi-agent environments [31][34]