Workflow
概率漂移
icon
Search documents
小米最新大模型成果!罗福莉现身了
量子位· 2025-10-17 04:58
Core Viewpoint - Xiaomi's latest AI research paper, co-authored with Peking University, focuses on improving stability and efficiency in large model reinforcement learning using a new method called Rollout Routing Replay (R3) [2][7][49]. Group 1: Research Background - The collaboration between Xiaomi's AI team and Peking University has led to significant advancements in AI, particularly in the area of reinforcement learning [2][4]. - The paper addresses challenges in the Mixture of Experts (MoE) architecture, which can lead to instability during training due to routing mechanisms [8][25]. Group 2: Methodology - The proposed R3 method aims to stabilize the training process by locking the routing distribution during inference and replaying it during training, ensuring consistency between the two phases [28][30]. - Additionally, the research introduces a routing mask to cache routing decisions alongside context, enhancing computational efficiency [34][35]. Group 3: Experimental Results - Experiments conducted on the Qwen3-30B-A3B model show that R3 consistently outperforms other methods across various metrics, indicating improved overall performance [40][41]. - The stability of training has significantly improved, with R3 maintaining a smoother performance curve compared to traditional methods [43][46]. Group 4: Authors and Contributions - The first author, Wenhan Ma, is a researcher at Xiaomi's LLM-Core team, while the two corresponding authors are Luo Fuli and Professor Sui Zhifang from Peking University, both of whom have notable contributions to the field [51][56][61].