月之暗面「调教」出最强Agent，在「人类最后一场考试」拿下最新 SOTA

Core Viewpoint - Kimi-Researcher is an advanced autonomous agent developed using end-to-end reinforcement learning, showcasing significant improvements in multi-step reasoning and search capabilities, achieving state-of-the-art performance in various benchmarks [2][4][3]. Group 1: Performance Metrics - Kimi-Researcher achieved a Pass@1 score of 26.9% and a Pass@4 accuracy of 40.17% in the "Humanity's Last Exam," marking a substantial improvement from an initial score of 8.6% [3][4]. - In the xbench-DeepSearch subtask, Kimi-Researcher reached an average Pass@1 score of 69%, outperforming other models equipped with search tools [4]. Group 2: Training Methodology - The agent is trained using end-to-end reinforcement learning, which allows it to learn from a single model that integrates planning, perception, and tool usage without the need for manual rule creation [14][24]. - The training process incorporates a reward mechanism based on final outcomes, ensuring consistent preference direction in dynamic environments [24]. Group 3: Context Management and Efficiency - Kimi-Researcher employs a context management mechanism that enables it to retain key information while discarding irrelevant documents, allowing for over 50 iterations in a single trajectory [27][30]. - The model's training efficiency is enhanced through the introduction of a gamma decay factor, encouraging the discovery of shorter and more efficient exploration paths [25]. Group 4: Tool Utilization and Task Design - The training tasks are designed to necessitate the use of specific tools, promoting the agent's learning on when and how to effectively utilize multiple tools in complex environments [21]. - Kimi-Researcher is capable of conducting academic research, legal and policy analysis, clinical evidence review, and corporate financial analysis, showcasing its versatility [11][8]. Group 5: Infrastructure and Scalability - A scalable, asynchronous rollout system has been developed to enhance the efficiency of agent interactions and reward calculations, significantly improving operational performance [34][32]. - The infrastructure supports dynamic resource allocation and fault tolerance, ensuring high availability in production environments [34].