大模型开始打王者荣耀了

Core Insights - The article discusses the implementation of the Think-In-Games (TiG) framework, which allows large language models to play the game Honor of Kings while learning in real-time, effectively bridging the gap between decision-making and action [1][3][4]. Group 1: TiG Framework Overview - TiG redefines decision-making based on reinforcement learning as a language modeling task, enabling models to generate strategies guided by language and optimize them through online reinforcement learning [3][4]. - The framework allows large language models to learn macro-level reasoning skills, focusing on long-term goals and team coordination rather than just micro-level actions [6][9]. - The model acts more like a strategic coach than a professional player, converting decisions into text and selecting macro actions based on game state [7][9]. Group 2: Training Methodology - The training process involves a multi-stage approach combining supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance model capabilities [12][16]. - The research team utilized a "relabeling algorithm" to ensure each game state is tagged with the most critical macro action, providing a robust signal for subsequent training [9][11]. - The Group Relative Policy Optimization (GRPO) algorithm is employed to maximize the advantages of generated content while limiting divergence from reference models [9][11]. Group 3: Experimental Results - The results indicate that the combination of SFT and GRPO significantly improves model performance, with Qwen-2.5-32B's accuracy increasing from 66.67% to 86.84% after applying GRPO [14][15]. - The Qwen-3-14B model achieved an impressive accuracy of 90.91% after training with SFT and GRPO [2][15]. - The TiG framework demonstrates competitive performance compared to traditional reinforcement learning methods while significantly reducing data and computational requirements [17].