智能体框架

Search documents
SEAgent:开启从实战经验中自我进化的GUI智能体新纪元
机器之心· 2025-08-17 04:28
Core Viewpoint - The development of Current Computer Using Agents (CUA) is heavily reliant on expensive human-annotated data, which limits their application in novel or specialized software environments. To overcome this limitation, researchers from Shanghai Jiao Tong University and The Chinese University of Hong Kong proposed SEAgent, a new framework that allows agents to learn and evolve autonomously through interaction with their environment without human intervention [2][4]. Group 1: SEAgent Framework - SEAgent's core innovation lies in its closed-loop autonomous evolution framework, a deeply optimized evaluation model, and an efficient "specialist-generalist" integration strategy [2][5]. - The autonomous evolution capability of SEAgent is derived from the collaborative functioning of three core components, forming a sustainable and self-driven learning loop [5]. Group 2: Core Components - The Curriculum Generator acts as a "mentor," automatically generating progressively challenging exploration tasks based on the agent's current capabilities and maintaining a "software guide" to document new functionalities discovered during exploration [9]. - The Actor-CUA, which is the agent itself, executes the tasks generated by the Curriculum Generator in the software environment [9]. - The World State Model serves as the "judge," evaluating the agent's performance at each step and providing critical feedback signals for learning, thus completing the evolution loop [9][10]. Group 3: Evaluation Model - A precise "judge" is fundamental to autonomous evolution. Existing open-source large visual language models struggle with evaluating long sequences of agent operations, leading to decreased accuracy with excessive historical inputs. To address this, a more robust evaluation model, the World State Model, was developed [10]. - The optimized World State Model significantly reduces the performance gap with commercial models like GPT-4o, providing reliable and stable evaluation capabilities for the SEAgent framework [10]. Group 4: Specialist-to-Generalist Strategy - The research explores building a "generalist" model capable of operating across multiple software environments, finding that training a generalist directly in multi-software settings is less effective than training specialist models in single software environments [13]. - A three-step efficient "specialist-to-generalist" integration strategy is proposed, which includes innovating the evaluation paradigm, high-quality data distillation, and cultivating specialists before transitioning to a generalist model [14][15]. Group 5: Experimental Results - The final "generalist" agent achieved an overall success rate of 34.5%, surpassing the performance of directly trained generalist models (30.6%) and exceeding the combined performance of all specialist models (32.2%), demonstrating the potential of the "specialist first, then generalist" approach [18]. - Rigorous ablation experiments confirm the necessity of the algorithm design, showing that a high-quality World State Model is essential for effective learning, and exploration-based reinforcement learning (GRPO) significantly outperforms mere imitation [20]. Group 6: Author and Research Interests - The first author of the study is Sun Zeyi, a joint doctoral student from Shanghai Jiao Tong University and the Shanghai Artificial Intelligence Laboratory, with multiple publications in CVPR, ICCV, and NeurIPS, and research interests in GUI-Agent, multimodal learning, and reinforcement learning [20].
腾讯AI Lab开源可复现的深度研究智能体,最大限度降低外部依赖
量子位· 2025-08-06 05:56
Core Insights - The article discusses the transformative potential of Deep Research Agents powered by large language models (LLMs) and vision-language models (VLMs) in knowledge discovery and problem-solving [1] - It highlights the limitations of existing open-source agent frameworks that rely on paid tools, which restrict reproducibility and universality [2] Group 1: Cognitive Kernel-Pro Framework - Tencent AI Lab has launched Cognitive Kernel-Pro, a fully open-source, multi-module, hierarchical agent framework that provides a breakthrough solution for the development and training of deep research agents [4] - Cognitive Kernel-Pro outperforms the open-source free framework SmolAgents on the GAIA benchmark suite, with its 8B model surpassing WebDancer and WebSailor-7B on GAIA-text [5] - The framework's technical reports and code have been made available on GitHub, promoting community engagement and reproducibility [8] Group 2: Core Design Features - The framework features a modular architecture with a two-layer design, consisting of a main agent responsible for task decomposition and multiple sub-agents focused on specific tasks, ensuring modular independence and scalability [11] - It incorporates a "Progress State" mechanism for structured state management, enhancing efficiency in handling complex tasks by tracking completed steps and key information [11] - Standardized task interfaces allow communication between the main and sub-agents through simple text interfaces, facilitating collaboration and debugging [11] - The framework employs reflection and voting mechanisms to optimize task completion quality, particularly in high-variability tasks like web browsing [11] Group 3: Innovative Training Methods - Cognitive Kernel-Pro includes a comprehensive training process covering web navigation, file processing, code generation, and reasoning, with a focus on high-quality data construction [16][17] - The training data is enhanced through the use of verifiable query-answer pairs and diverse synthetic queries generated from Persona Hub, improving data quality and robustness [17] - Existing datasets have been refined to align with agent task formats, ensuring relevance to real-world applications [17] Group 4: Performance Advantages - Cognitive Kernel-Pro demonstrates superior performance in web information retrieval, file processing, and complex reasoning tasks, closely approaching the capabilities of paid tool-dependent agent frameworks [19][20] - The framework emphasizes the inherent capabilities of LLMs and VLMs, minimizing external dependencies and achieving true open-source status [20] - Performance comparisons show that Cognitive Kernel-Pro excels in functionality and open-source accessibility compared to existing frameworks [20][22] Group 5: Future Directions - The research team plans to focus on distilling reflection capabilities into a unified agent base model in future work [26]
o3-pro通关“推箱子”,人类怀旧小游戏成了大模型新Benchmark
量子位· 2025-06-16 04:50
Core Viewpoint - Classic nostalgic games like Sokoban and Tetris have become benchmarks for evaluating large models, with the o3-pro model recently surpassing previous performance limits in these games [1][2][6]. Group 1: Benchmark Performance - The o3-pro model successfully completed all levels of Sokoban, which previously had a benchmark limit at the sixth level [3][8]. - In comparison to the previous state-of-the-art model (SOTA), o3, the performance of o3-pro has doubled [3][10]. - The scoring system for Tetris involves calculating the number of placed blocks and the number of cleared lines multiplied by ten, until the game ends [13][22]. Group 2: Game Characteristics and Evaluation - The Lmgame benchmark includes several games, such as 2048, Candy Crush, Super Mario Bros, and Phoenix Wright, each with unique evaluation criteria [18][24]. - The evaluation for 2048 is based on the total value of merged blocks, while Candy Crush measures the total candies eliminated in a fixed number of rounds [24]. - The evaluation methods do not consider time as a factor, focusing instead on game-specific performance metrics [22][24]. Group 3: Model Development and Support - The project is developed by the Hao AI Lab at UCSD, which is affiliated with the machine learning systems and NLP labs [28]. - The lab has received funding from Google and NVIDIA, with NVIDIA donating a DGX B200 system to support their research [34]. - The benchmark is open-source, allowing interested parties to download and test their models [23].
o3-pro通关“推箱子”,人类怀旧小游戏成了大模型新Benchmark
量子位· 2025-06-16 04:49
Core Viewpoint - Classic nostalgic games like "Sokoban" and "Tetris" have become benchmarks for evaluating large models, with the o3-pro model achieving significant breakthroughs in these games [1][6]. Group 1: Benchmark Performance - The o3-pro model surpassed previous benchmarks by completing all levels of Sokoban, while the best prior model, o3, only reached the sixth level [2][3]. - In Tetris, the scoring system combines the number of placed blocks with ten times the number of cleared lines, and o3-pro's performance doubled that of o3 [3][13]. - The o3-pro model's performance is notable for its time-consuming operations, taking several minutes for each move [17]. Group 2: Game Evaluation Standards - The Lmgame benchmark includes various games, with specific evaluation metrics for each, such as total distance moved in Super Mario Bros and total candy cleared in Candy Crush [6][24]. - The evaluation does not consider time as a factor, focusing instead on game-specific performance metrics [22]. - The benchmark is open-source, allowing others to download and test their models [23]. Group 3: Development and Support - The project is developed by the Hao AI Lab at UCSD, which has received support from Google and NVIDIA [28][34]. - The lab has created multiple open-source projects, with FastVideo being the most starred on GitHub [32].