机器之心
Search documents
性能逼近闭源最强,通义实验室开源Mobile-Agent-v3刷新10项GUI基准SOTA
机器之心· 2025-09-02 03:44
Core Viewpoint - The article highlights the launch of the GUI-Owl and Mobile-Agent-v3, which are advanced open-source models for GUI automation, showcasing superior performance compared to existing models and emphasizing their capabilities in various environments [1][29]. Group 1: Key Achievements - GUI-Owl has achieved state-of-the-art (SOTA) performance on both Android and desktop platforms, with the 32B model surpassing closed-source top models in multiple evaluations [21][29]. - The models are designed to operate in a cloud environment, allowing for dynamic task execution and data collection across multiple operating systems, including Android, Ubuntu, macOS, and Windows [11][29]. Group 2: Technical Innovations - The system employs a self-evolving data production chain that minimizes human involvement in generating high-quality training data, allowing the models to iteratively optimize themselves [11][14]. - GUI-Owl's capabilities include advanced UI element grounding, long task planning, and robust reasoning, enabling it to understand and execute complex tasks effectively [16][20]. Group 3: Reinforcement Learning Framework - A scalable reinforcement learning (RL) system has been developed to enhance the model's stability and adaptability in real-world environments, allowing it to learn continuously from its interactions [22][26]. - The introduction of the Trajectory-aware Relative Policy Optimization (TRPO) algorithm addresses the challenges of sparse and delayed reward signals in GUI automation tasks, improving learning efficiency [26]. Group 4: Conclusion - The release of GUI-Owl and Mobile-Agent-v3 represents a significant advancement in open-source GUI automation, providing a powerful tool for various applications while reducing deployment and resource costs [29].
AI读网页,这次真不一样了,谷歌Gemini解锁「详解网页」新技能
机器之心· 2025-09-02 03:44
Core Viewpoint - Google is returning to its core business of search by introducing the Gemini API's URL Context feature, which allows AI to "see" web content like a human [1]. Group 1: URL Context Functionality - The URL Context feature enables the Gemini model to access and process content from URLs, including web pages, PDFs, and images, with a content limit of up to 34MB [1][5]. - Unlike traditional methods where AI reads only summaries or parts of a webpage, URL Context allows for deep and complete document parsing, understanding the entire structure and content [5][6]. - The feature supports various file formats, including PDF, PNG, JPEG, HTML, JSON, and CSV, enhancing its versatility [7]. Group 2: Comparison with RAG - URL Context Grounding is seen as a significant advancement over the traditional Retrieval-Augmented Generation (RAG) approach, which involves multiple complex steps such as content extraction, chunking, vectorization, and storage [11][12]. - The new method simplifies the process, allowing developers to achieve accurate results with minimal coding, eliminating the need for extensive data processing pipelines [13][14]. - URL Context can accurately extract specific data from documents, such as financial figures from a PDF, which would be impossible with just summaries [14]. Group 3: Operational Mechanism - The URL Context operates on a two-step retrieval process to balance speed, cost, and access to the latest data, first attempting to retrieve content from an internal index cache [25]. - If the URL is not cached, it performs real-time scraping to obtain the content [25]. - The pricing model is straightforward, charging based on the number of tokens processed from the content, encouraging developers to provide precise information sources [27]. Group 4: Limitations and Industry Trends - URL Context has limitations, such as being unable to access content behind paywalls, specialized tools like YouTube videos, and having a maximum capacity of processing 20 URLs at once [29]. - The emergence of URL Context indicates a trend where foundational models are increasingly integrating external capabilities, reducing the complexity previously handled by application developers [27].
14B打败671B!微软rStar2-Agent在数学推理上超过DeepSeek-R1
机器之心· 2025-09-02 01:27
Core Viewpoint - The article discusses the advancements in large language models (LLMs) through the introduction of rStar2-Agent, a powerful agentic reinforcement learning method developed by Microsoft Research, which enhances reasoning capabilities and performance in mathematical reasoning tasks. Group 1: Model Development and Innovations - The rStar2-Agent model utilizes test-time scaling to enhance reasoning capabilities, allowing for longer and smarter thinking processes through the integration of advanced cognitive abilities and tool interactions [1][2]. - The model was trained using a 14 billion parameter architecture, achieving performance levels comparable to or exceeding that of larger models like DeepSeek-R1, which has 671 billion parameters [2][25]. - The training infrastructure developed for rStar2-Agent can handle 45,000 concurrent tool calls with an average feedback execution time of just 0.3 seconds, significantly improving training efficiency [14][13]. Group 2: Training Methodology - The team introduced a novel training scheme that begins with a non-reasoning supervised fine-tuning (SFT) phase, focusing on general instruction following and tool usage, which helps avoid overfitting and maintains shorter initial responses [21][19]. - The GRPO-RoC method was implemented to enhance the efficiency of active reinforcement learning in the coding environment, allowing for better handling of noise and improving the quality of training trajectories [19][18]. - The model achieved state-of-the-art mathematical reasoning performance with only 510 reinforcement learning steps, demonstrating exceptional training efficiency [23][25]. Group 3: Performance Metrics - rStar2-Agent-14B achieved an accuracy of 80.6% on the AIME24 benchmark, outperforming other models such as o3-mini, DeepSeek-R1, and Claude Opus 4.0 by margins of 1.0%, 0.8%, and 3.6% respectively [26]. - The model exhibited strong generalization capabilities beyond mathematics, despite being trained primarily on mathematical tasks [27]. - In terms of response length, rStar2-Agent-14B produced shorter average responses compared to larger models, indicating a more efficient reasoning process [29].
自搜索强化学习SSRL:Agentic RL的Sim2Real时刻
机器之心· 2025-09-02 01:27
Core Insights - The article discusses the development and effectiveness of SSRL (Structured Search Reinforcement Learning) in enhancing the training efficiency and stability of Search Agents using large language models (LLMs) [6][28] - SSRL demonstrates superior performance over traditional methods that rely on external search engines, achieving effective transfer from simulation to real-world applications (Sim2Real) [6][28] Group 1 - SSRL utilizes structured prompts and format rewards to effectively extract world knowledge from models, leading to improved performance across various benchmarks and reduced hallucination [2][6] - The research highlights the high costs and inefficiencies associated with current RL training methods for Search Agents, which include full-real and semi-real search approaches [7][13] - The introduction of SSRL allows for a significant increase in training efficiency, estimated at approximately 5.6 times, while maintaining a continuous increase in training rewards without collapse [31][32] Group 2 - Experiments show that models trained with SSRL outperform those relying on external engines, particularly in real-world search scenarios, indicating the importance of integrating real-world knowledge [28][31] - The article presents findings that suggest the combination of self-generated knowledge and real-world knowledge can enhance model performance, particularly through entropy-guided search strategies [34] - The integration of SSRL with TTRL (Task-Driven Reinforcement Learning) has shown to improve generalization and effectiveness, achieving up to a 67% performance increase in certain tasks [38][39]
开学了:入门AI,可以从这第一课开始
机器之心· 2025-09-01 08:46
Core Viewpoint - The article emphasizes the importance of understanding AI and its underlying principles, suggesting that individuals should start their journey into AI by grasping fundamental concepts and practical skills. Group 1: Understanding AI - AI is defined through various learning methods, including supervised learning, unsupervised learning, and reinforcement learning, which allow machines to learn from data without rigid programming rules [9][11][12]. - The core idea of modern AI revolves around machine learning, particularly deep learning, which enables machines to learn from vast amounts of data and make predictions [12]. Group 2: Essential Skills for AI - Three essential skills for entering the AI field are mathematics, programming, and practical experience. Mathematics provides the foundational understanding, while programming, particularly in Python, is crucial for implementing AI concepts [13][19]. - Key mathematical areas include linear algebra, probability and statistics, and calculus, which are vital for understanding AI algorithms and models [13]. Group 3: Practical Application and Tools - Python is highlighted as the primary programming language for AI due to its simplicity and extensive ecosystem, including libraries like NumPy, Pandas, Scikit-learn, TensorFlow, and PyTorch [20][21]. - Engaging in hands-on projects, such as data analysis or machine learning tasks, is encouraged to solidify understanding and build a portfolio [27][46]. Group 4: Career Opportunities in AI - Various career paths in AI include machine learning engineers, data scientists, and algorithm researchers, each focusing on different aspects of AI development and application [38][40]. - The article suggests that AI skills can enhance various fields, creating opportunities for interdisciplinary applications, such as in finance, healthcare, and the arts [41][43]. Group 5: Challenges and Future Directions - The rapid evolution of AI technology presents challenges, including the need for continuous learning and adaptation to new developments [34][37]. - The article concludes by encouraging individuals to embrace uncertainty and find their passion within the AI landscape, highlighting the importance of human creativity and empathy in the technological realm [71][73].
OpenAI大神:人工智能导论课程停在15年前,本科首选该是机器学习导论
机器之心· 2025-09-01 08:46
Core Viewpoint - The article emphasizes the importance of selecting the right introductory course in artificial intelligence (AI) for beginners, suggesting that "Introduction to Machine Learning" should be prioritized over "Introduction to AI" due to the outdated content of the latter [2][3]. Group 1: Course Recommendations - Noam Brown, a researcher from OpenAI, advises undergraduate students interested in AI to be cautious and not to choose "Introduction to AI" as their first course [2]. - The article highlights that many universities' "Introduction to AI" courses have not evolved significantly over the past 15 years, often lacking comprehensive coverage of machine learning topics [3]. - A well-structured introductory course should ideally include topics such as linear regression, gradient descent, backpropagation, and reinforcement learning [3]. Group 2: Course Content Comparison - "Introduction to AI" often covers traditional topics like rule-based systems and expert systems, while "Introduction to Machine Learning" focuses on modern AI technologies, including linear regression, neural networks, and deep learning [6]. - The renowned course "CS229: Machine Learning" at Stanford, taught by Andrew Ng, covers supervised learning, unsupervised learning, generative models, and foundational deep learning concepts [6]. Group 3: Industry Relevance - The article notes that most breakthroughs in AI today stem from machine learning and deep learning, rather than the older topics covered in traditional AI courses [11]. - There is a growing sentiment that students should focus on practical skills like prompt engineering and programming to navigate the evolving AI landscape effectively [11].
DeepSeek、GPT-5都在尝试的快慢思考切换,有了更智能版本,还是多模态
机器之心· 2025-09-01 06:46
Core Insights - The article discusses the development of the R-4B multimodal large model by Tencent and the Institute of Automation, Chinese Academy of Sciences, which addresses the "overthinking" dilemma in AI models by introducing an adaptive thinking mechanism [3][5][10]. Group 1: Model Development and Performance - R-4B utilizes an "auto-thinking" mechanism that allows the AI to switch between direct responses for simple questions and deep reasoning for complex problems, optimizing accuracy while minimizing computational costs [5][21]. - The model has set a new performance benchmark among 4B-scale multimodal models, outperforming larger models like Keye-VL-8B and Kimi-VL-A3B-Thinking-2506 in various evaluation metrics [7][24]. - R-4B achieved top rankings on the OpenCompass multimodal academic leaderboard, specifically ranking first among multimodal models under 20B in size [10][12]. Group 2: Training Methodology - The core innovation of R-4B lies in its unique two-stage training strategy, which includes bi-mode annealing to teach the model both thinking and non-thinking capabilities [16][18]. - The model's training involves a mix of data types, where it learns to respond directly to simple queries and engage in detailed reasoning for complex tasks, laying a solid foundation for adaptive thinking [18][22]. - The Bi-mode Policy Optimization (BPO) reinforcement learning algorithm allows the model to learn when to switch thinking modes without relying on specifically designed reward functions [18][24]. Group 3: Applications and Future Prospects - R-4B's adaptive thinking capability enhances automation efficiency in various applications, such as document content extraction and scientific research, where it can analyze complex data relationships [27][29]. - The model is designed for deployment on consumer-grade devices, making it suitable for low-power scenarios like smart homes and instant Q&A systems [12][29]. - The lightweight and intelligent design of R-4B contributes to sustainable development in AI, addressing the rising costs of computation and reasoning [33][34].
NeurIPS 2025:高分论文也可能被拒,只为保住那25%左右的接收率?
机器之心· 2025-09-01 06:46
| | | 要指标还是更多有价值的论文,顶级学术会议似乎也面临着「to be or not to be」的难题。 NeurIPS 2025 将于 2025 年 12 月 2 日到 7 日在美国圣地亚哥举办,并且首次设置了第二个官方分会场墨西哥城。 最近几天,根据国内外社交媒体的众多反馈,本届 NeurIPS 的 Meta Review(元评审,即多位匿名审稿人提交评审意见后由领域主席或高级审稿人撰写总结性评 审)已经陆续完成。 出自: MiroMind 研究科学家 Bai Song (小红书) 从更多领域主席(AC)透露的消息中,有一些现象关系到了投稿人论文最终能否被接收。 其中,有领域主席表示,「在 DB(数据集和基准) track,即使得分 4-4-4-5(均分 4.25)也有可能被拒稿。」根据此前的相关数据统计,本届 NeurIPS 的投稿数 量或达到史上最多的 30000。 他认为,不要为了接收率固定在 20% 到 25%,而拒掉获得审稿人积极评分并达成共识的论文。并且,他呼吁向程序主席(PC)建议提高接收率。而根据 Senior PC(高级程序委员会成员)的回复, 由于场地和资源有限以及投稿量超出了 ...
科研智能体「漫游指南」—助你构建领域专属科研智能体
机器之心· 2025-09-01 02:49
Core Insights - The article presents a comprehensive guide for constructing scientific agents based on large language models (LLMs), emphasizing the integration of AI in scientific research and addressing the epistemological and methodological gaps between AI and natural sciences [2][4]. Summary by Sections Overview of Scientific Agents - The guide aims to provide a structured approach to building scientific agents, detailing the levels of agent capabilities and construction strategies throughout the entire scientific research lifecycle [2][4]. Levels of Scientific Agents - Scientific agents are categorized into three levels: - **Agent as Assistant**: Limited to specific tasks within a domain, constructed using small models through post-training or fine-tuning, with high performance in specialized tasks but lacking comprehensive operational capabilities [8]. - **Agent as Partner**: Integrates various tools for enhanced capabilities, utilizing closed-source large models and modular design to independently perform tasks like literature consultation and hypothesis generation, though still limited in self-validation and reliability [8]. - **Agent as Avatar**: Focuses on multi-dimensional capability enhancement, featuring strong reasoning, memory, and collaboration skills, capable of providing comprehensive support across various research stages [8]. Construction Process of Scientific Agents - The construction process involves three main components: - **Knowledge Organization**: Structuring scientific information for effective understanding and reasoning, including unstructured sequences, structured data, instructions, and knowledge graphs [12][14]. - **Knowledge Injection**: Embedding domain-specific expertise into agents through explicit or implicit methods to enhance their problem-solving capabilities [12][14]. - **Tool Integration**: Expanding agent functionalities by incorporating external tools for specialized tasks, enabling autonomous operation and coordination of resources [12][14]. Capability Enhancement of Scientific Agents - Enhancements focus on: - **Memory Enhancement**: Essential for maintaining context and executing multi-step reasoning, utilizing various memory structures to support complex tasks [19]. - **Reasoning Enhancement**: Addressing limitations of LLMs through structured reasoning chains and domain-specific optimizations to improve output reliability [19]. - **Collaboration Enhancement**: Improving interactions between multi-agent systems and human researchers to optimize research outcomes [19]. Benchmarking and Evaluation - Benchmarks are categorized into knowledge-intensive and experiment-driven tasks, each emphasizing different aspects of scientific research processes [17][18]. - **Knowledge-Intensive Tasks**: Focus on complex, domain-specific tasks requiring deep expertise [17]. - **Experiment-Driven Tasks**: Evaluate the agent's ability to design and validate experiments autonomously [18]. Future Research Directions - Future efforts should focus on: - Ensuring empirical accuracy in scientific experiment designs and integrating verification tools [23]. - Designing flexible frameworks for complex task adaptation in specific research areas [23]. - Incorporating self-reflection and iterative mechanisms for continuous improvement [23]. - Optimizing interactions between agents and human researchers to enhance scientific discovery [23].
科普向:一文解构大模型后训练,GRPO和它的继任者们的前世今生
机器之心· 2025-09-01 02:49
Core Viewpoint - The article discusses the evolution and significance of the Group Relative Policy Optimization (GRPO) algorithm in the context of large language models and reinforcement learning, highlighting its advantages and limitations compared to previous methods like Proximal Policy Optimization (PPO) [4][38]. Summary by Sections Development of Large Language Models - The rapid advancement of large language models has led to the emergence of various post-training methods, with GRPO being a notable innovation that enhances reinforcement learning paradigms [3][5]. Post-Training and Reinforcement Learning - Post-training is crucial for refining models' capabilities in specific domains, enhancing adaptability and flexibility to meet diverse application needs [12][11]. - Reinforcement learning, particularly through human feedback (RLHF), plays a vital role in the post-training phase, aiming to optimize model outputs based on user preferences [14][19]. GRPO and Its Advantages - GRPO eliminates the need for a separate critic model, reducing memory and computational costs significantly compared to PPO, which requires dual networks [30][35]. - The GRPO framework utilizes historical performance data to establish a baseline for evaluating model improvements, thus simplifying the training process [34][35]. Comparison of GRPO and PPO - GRPO offers substantial improvements in memory requirements and training speed, making it a more efficient choice for large language model training [37]. - Despite its advantages, GRPO still faces stability issues similar to those of PPO, particularly in smaller-scale reinforcement learning tasks [39]. Recent Innovations: DAPO, GSPO, and GFPO - DAPO introduces enhancements to GRPO, such as Clip-Higher and dynamic sampling, to address practical challenges encountered during training [41][42]. - GSPO advances the methodology by shifting the focus from token-level to sequence-level importance sampling, significantly improving training stability [48][49]. - GFPO allows for simultaneous optimization of multiple response attributes, addressing limitations of GRPO related to scalar feedback and multi-round reasoning tasks [61][63]. Conclusion - The evolution of post-training methods, from PPO to GRPO and beyond, illustrates a clear trajectory in optimizing large language models, with GRPO serving as a pivotal point for further advancements in the field [81][82].