机器之心

Search documents
7B智能体仅凭9个任务训练即超越R1!上交大打造AI-for-AI新范式
机器之心· 2025-06-21 01:33
Core Viewpoint - The article discusses the emergence of AI-for-AI (AI4AI) as a solution to the limitations of traditional AI development, which heavily relies on human intervention and manual tuning, thereby slowing down innovation and the path to Artificial General Intelligence (AGI) [1][6]. Group 1: AI4AI Development - AI4AI aims to enable AI agents to autonomously design, optimize, and improve AI algorithms, significantly reducing human involvement and accelerating the iterative development cycle [1][6]. - A recent study by Shanghai Jiao Tong University and Shanghai AI Lab demonstrated that a 7 billion parameter AI agent (ML-Agent) could surpass a 671 billion parameter model (Deepseek-R1) in performance by utilizing a new paradigm of "experience learning" [2][9]. Group 2: Traditional Machine Learning Challenges - Traditional machine learning processes are time-consuming and inefficient, often requiring days to months for model design and parameter tuning, which limits the speed of AI innovation [4][5]. - Existing AI agents still depend on human-designed prompts, leading to a cycle of waiting, modifying, and retrying, which perpetuates inefficiency [5][6]. Group 3: Breakthroughs in Autonomous Machine Learning - The study introduces a learning-based paradigm for autonomous machine learning, allowing agents to learn from execution trajectories through online reinforcement learning, enabling proactive exploration of strategies [7][9]. - The ML-Agent, powered by a 7 billion parameter model, demonstrated remarkable performance improvements by learning from just nine machine learning tasks, showcasing its ability to generalize across tasks [20][24]. Group 4: Training Framework and Methodologies - The training framework includes three core breakthroughs that enhance the self-evolution of AI agents, such as exploration-enriched fine-tuning and a step-wise reinforcement learning paradigm [11][15]. - A customized reward module was developed to unify feedback from complex experimental results, providing consistent signals for reinforcement learning optimization [19][20]. Group 5: Performance Comparison and Results - ML-Agent outperformed several advanced AI models in both seen and unseen machine learning tasks, demonstrating its strong generalization capabilities [20][22]. - The research highlights that ML-Agent's performance consistently improved throughout training, surpassing all baseline methods and establishing a new paradigm for AI design [24][25]. Group 6: Community and Future Directions - ML-Agent is part of the MASWorks open-source community, which aims to connect global researchers and foster collaboration in the multi-agent systems field [26][27]. - The community plans to host a workshop focused on large language models and multi-agent systems at ICML 2025, encouraging participation from scholars worldwide [28].
舍弃CUDA编程!CMU等用几十行代码将LLM编译成巨型内核,推理延迟可降6.7倍
机器之心· 2025-06-21 01:33
Core Viewpoint - The introduction of the Mirage Persistent Kernel (MPK) compiler by a team led by Zhihao Jia from CMU significantly reduces the inference latency of large language models (LLMs) by 1.2 to 6.7 times, addressing the high manual optimization costs and end-to-end delays associated with CUDA-driven LLM inference [3][4][12]. Group 1: Introduction of MPK - MPK is designed to automatically convert LLMs into optimized megakernels, which can execute the entire model without interruption, thus enhancing performance [9][10]. - The MPK compiler allows developers to compile LLMs with minimal manual effort, requiring only a few lines of Python code [5][12]. Group 2: Performance Advantages - MPK eliminates kernel launch overhead and maximizes the overlap of computation, data loading, and GPU communication, resulting in significantly lower inference latency [14][18]. - The performance improvements of MPK increase with the number of GPUs, making it particularly efficient in multi-GPU deployment scenarios [18]. Group 3: Working Mechanism of MPK - MPK consists of two main components: a compiler that transforms LLM computation graphs into fine-grained task graphs, and a runtime system that executes these task graphs within a single megakernel [19][24]. - The MPK compiler captures dependencies at a finer granularity, allowing for more aggressive pipeline optimizations compared to existing systems [26][27]. Group 4: Future Plans - The team aims to enhance MPK's usability and performance, with ongoing efforts to support dynamic workloads and advanced scheduling strategies [40][43].
2025 年了,企业的 AI 采购预算都在怎么花?
机器之心· 2025-06-20 17:04
本文来自PRO会员通讯内容,文末关注「机器之心PRO会员」,查看更多专题解读。 a16z 近期发布 2025 年度的「企业如何采购 AI」主题报告,该报告基于对全球企业高管的深度访谈与广泛调 研,揭示了 2025 年企业在以 LLM 为代表的生成式 AI 的采购、部署与预算分配上的关键趋势。 目录 01. 为何企业的 AI 预算只增不减? 为什么企业在的 AI 支出一直在增加?企业的 AI 预算构成都有哪些变化?企业部署 AI 的目的在如何转变?... 02 . 货比三家,什么样的 LLM 能让企业掏钱? 为什么企业更看重 LLM 的「差异化」而非「商业化」?为什么开源模型越来越受欢迎?大小企业选择 LLM 的偏好有何区 别?... 03. 企业如何像采购传统软件一样采购 AI 模型? 企业现在采购 AI 模型都考虑哪些因素?外部基准对 AI 采购有什么影响?... ① 该报告是 a16z 的研究主题系列之一,其研究团队此前在 2024 年 2 月发布「企业构建和购买新一代人工智能的 16 项变革」。该报告从数十位《财富》500 强企业和顶级企业的领导者和 70 多位高管进行访谈和调查,得到了 16 项核心发 ...
突破开放世界移动操作!首个室内移动抓取多模态智能体亮相,微调模型真实环境零样本动作准确率达 90%
机器之心· 2025-06-20 11:59
在家庭服务机器人领域,如何让机器人理解开放环境中的自然语言指令、动态规划行动路径并精准执行操作,一直是学界和工业界的核心挑战。 近日,上海人工智能实验室联合新加坡国立大学、香港大学等机构的研究团队,提出了 " OWMM-Agent " 具身智能体——首个专为开放世界移动操作 (OWMM)设计的多模态智能体 (VLM Agent) 架构,首次实现了全局场景理解、机器人状态跟踪和多模态动作生成的统一建模。 同时该工作通过仿真器合成智能体轨迹数据,微调了针对该任务的多模态大模型 OWMM-VLM,在真实环境测试下,该模型零样本单步动作预测准确率达 90%。 论文链接:https://arxiv.org/pdf/2506.04217 Github 主页:https://github.com/HHYHRHY/OWMM-Agent 一、问题背景介绍:开放语义下的移动抓取任务 传统移动抓取机器人在家庭场景处理 "清理餐桌并将水果放回碗中" 这类开放指令时,往往需要依赖预先构建的场景 3D 重建或者语义地图,不仅耗时且 难以应对动态环境。OWMM 任务的核心难点在于: 二、OWMM-Agent:用 VLM 重构机器人 "大脑 ...
刚刚,华为盘古大模型5.5问世!推理、智能体能力大爆发
机器之心· 2025-06-20 11:59
Core Viewpoint - Huawei's Pangu model series emphasizes practical applications in various industries, focusing on intelligent upgrades and achieving significant market recognition through its iterations from Pangu 1.0 to Pangu 5.0 [2][3]. Group 1: Pangu Model 5.5 Release - Huawei officially launched Pangu Model 5.5 at the HDC 2025, showcasing its advanced natural language processing (NLP) capabilities and pioneering achievements in multimodal models [3][5]. - The upgraded Pangu 5.5 includes five foundational models targeting NLP, multimodal, prediction, scientific computing, and computer vision (CV), positioning itself as a core driver for industry digital transformation [4][46]. Group 2: NLP Models - Pangu 5.5 features three main NLP models: Pangu Ultra MoE, Pangu Pro MoE, and Pangu Embedding, along with an efficient reasoning strategy and the DeepDiver product [7]. - Pangu Ultra MoE is a near trillion-parameter model with 718 billion parameters, achieving domestic leadership and international competitiveness through innovative training methods [9][10]. - Pangu Pro MoE, with 72 billion parameters, ranked first domestically among models under 100 billion parameters in the SuperCLUE leaderboard, demonstrating its effectiveness in intelligent tasks [18][20]. - Pangu Embedding, a 7 billion parameter model, excels in knowledge, coding, mathematics, and dialogue capabilities, outperforming contemporaneous models [27][32]. Group 3: Technological Innovations - Huawei introduced adaptive fast-slow thinking technology in Pangu models, allowing for efficient problem-solving based on complexity, enhancing reasoning efficiency by up to 8 times [35]. - The DeepDiver model enhances high-level capabilities such as autonomous planning and exploration, achieving significant efficiency in complex question-answering tasks [41][44]. Group 4: Other Model Applications - Pangu 5.5 also includes models for scientific computing, industrial prediction, and computer vision, showcasing its versatility and potential for transformative applications across various sectors [46]. - The scientific computing model collaborates with the Shenzhen Meteorological Bureau to improve weather forecasting accuracy through AI integration [47]. - The CV model, with 30 billion parameters, supports diverse visual data analysis and decision-making, significantly enhancing operational capabilities in industrial scenarios [47].
Agentic AI时刻!多智能体驱动,「一人公司」这就要来了
机器之心· 2025-06-20 10:37
Core Viewpoint - The article discusses the rapid advancements in Agentic AI, emphasizing its potential to transform various industries by automating complex tasks and enhancing productivity through innovative applications and tools [2][18][26]. Group 1: Agentic AI Overview - Agentic AI represents a shift from basic AI interactions to more autonomous capabilities, allowing AI to perform tasks independently based on user instructions [5][18]. - The technology enables AI to run for extended periods, perceive environments, and utilize various tools to complete complex tasks, demonstrating significant improvements in problem-solving abilities [3][4]. Group 2: Practical Applications - Amazon's Q Developer allows users to create applications with minimal coding, showcasing the ease of developing AI-driven solutions [6][8]. - The integration of AI in software development processes can lead to substantial time savings, as demonstrated by the migration of thousands of applications in a short period [56][59]. Group 3: Business Impact - Companies leveraging Agentic AI have reported increased productivity, reduced costs, and accelerated innovation cycles, indicating a tangible impact on operational efficiency [19][24]. - The collaboration between companies like Fosun Pharma and Amazon Web Services has led to significant reductions in time and costs associated with medical writing and translation tasks [24][26]. Group 4: Future Outlook - The article predicts that by 2028, 15% of daily work decisions will be autonomously made by Agentic AI, marking a significant shift in how software applications are defined and utilized [68]. - Amazon Web Services is positioning Agentic AI as a potential multi-billion dollar business, reflecting its strategic importance in the company's future growth [64][65].
打破推荐系统「信息孤岛」!中科大与华为提出首个生成式多阶段统一框架,性能全面超越 SOTA
机器之心· 2025-06-20 10:37
Core Viewpoint - The article discusses the innovative UniGRF framework, which unifies retrieval and ranking tasks in recommendation systems using a single generative model, addressing inherent issues in traditional multi-stage recommendation paradigms [1][3][16]. Group 1: Pain Points of Traditional Recommendation Paradigms - Traditional recommendation systems typically employ a multi-stage approach, where a recall phase quickly filters a large item pool, followed by a ranking phase that scores and orders the candidates. This method, while efficient, often leads to information loss and performance bottlenecks due to the independent training of each phase [3][4]. - The separation of tasks can result in the premature filtering of potential interests outside the user's information bubble, causing cumulative biases and difficulties in inter-stage collaboration [3][4]. Group 2: Advantages of UniGRF - UniGRF integrates retrieval and ranking into a single generative model, allowing for full information sharing and reducing information loss between tasks [7]. - The framework is model-agnostic and can seamlessly integrate with various mainstream autoregressive generative model architectures, enhancing its flexibility [8]. - By maintaining a single model instead of two independent ones, UniGRF potentially improves efficiency in both training and inference processes [9]. Group 3: Key Mechanisms of UniGRF - The framework includes a Ranking-Driven Enhancer, which promotes effective collaboration between the recall and ranking phases by leveraging the high precision of the ranking outputs to guide the recall process [10][11]. - It also features a Gradient-Guided Adaptive Weighter that dynamically adjusts the weights of the loss functions for the two tasks based on their learning rates, ensuring synchronized optimization and overall performance enhancement [12]. Group 4: Experimental Results - Extensive experiments on three large public recommendation datasets (MovieLens-1M, MovieLens-20M, Amazon-Books) demonstrated that UniGRF significantly outperforms state-of-the-art (SOTA) models, highlighting the advantages of its unified framework [14][18]. - The framework shows particularly notable improvements in ranking performance, which is crucial as it directly impacts the quality of recommendations presented to users [18]. - Initial tests indicate that UniGRF adheres to the scaling law, suggesting potential performance gains with increased model parameters [18]. Group 5: Future Directions - The introduction of UniGRF offers a novel and efficient solution for generative recommendation systems, overcoming traditional multi-stage paradigm issues. Future research aims to expand the framework to include more recommendation stages and validate its large-scale applicability in real-world industrial scenarios [16][17].
老罗数字人刷屏背后,AI导演正偷偷改写直播「剧本」
机器之心· 2025-06-20 10:37
Core Viewpoint - AI live streaming has evolved from a gimmick to a viable business model, showcasing the effectiveness of AI-generated digital hosts in engaging audiences and driving sales [2][5][24]. Group 1: AI Live Streaming Performance - During the 618 shopping festival, an AI live stream featuring digital personas of Luo Yonghao and Zhu Xiaomu attracted over 13 million viewers and generated a GMV exceeding 55 million yuan, outperforming Luo's previous live stream debut [3][5]. - The digital hosts demonstrated a high level of interaction and humor, effectively engaging with the audience, which surprised even the original host, Luo Yonghao [4][5]. Group 2: Technology Behind Digital Hosts - The digital personas were created using Baidu's multi-modal collaborative digital human technology, which integrates script-driven multi-modal collaboration, dynamic decision-making for real-time interaction, and high-fidelity long video generation [6][7]. - The core of this technology is script generation, which includes dialogue, multi-modal driving, and dynamic interaction, ensuring that the digital hosts' personalities and styles are accurately represented [10][12]. Group 3: Script Generation and Interaction - The script generation process addresses three key issues: style modeling for diverse dialogue, character modeling for realistic personas, and content planning to ensure accuracy and engagement [12]. - Multi-modal driving allows the language model to generate dialogue while simultaneously producing visual and audio cues, enhancing the synchronization of speech and actions [13]. Group 4: Voice Synthesis and Emotional Expression - Baidu's "text-controlled voice synthesis" approach ensures that the generated speech reflects emotional nuances and natural rhythm, making the digital hosts more relatable and engaging [16]. - The technology also addresses the challenges of dual-host interactions, ensuring seamless transitions and natural dialogue flow between digital personas [16]. Group 5: High Fidelity Video Generation - The technology for generating high-fidelity digital humans focuses on achieving consistency across audio, visual, and dialogue elements, which is crucial for maintaining viewer immersion during long live streams [18][20]. - Baidu's approach includes modeling character and product interactions independently to ensure accurate and responsive engagement throughout the live stream [20]. Group 6: Future Implications - Baidu's early investment in AI technology has positioned it as a leader in the field, with continuous advancements in its large model capabilities, enhancing the realism and intelligence of digital hosts [22][24]. - The success of the Luo Yonghao digital live stream exemplifies the practical application of Baidu's technology in commercial settings, indicating potential for further exploration of innovative business models [24].
天工不止造物,也能修bug:Skywork-SWE给代码智能体补上软件工程课
机器之心· 2025-06-20 02:22
Core Viewpoint - The article discusses the emergence of Skywork-SWE, an autonomous code intelligence model developed by Kunlun Wanwei, aimed at addressing the complexities of software engineering and bug fixing in modern code systems, drawing parallels to the craftsmanship spirit of ancient Chinese artisans [2][7][40]. Group 1: Background and Challenges - The need for Skywork-SWE arises from the increasing complexity of software systems, which are integral to modern civilization, yet prone to bugs due to various factors such as logical errors and environmental changes [3][4]. - Bug fixing is identified as a fundamental yet complex task in software engineering, often requiring deep understanding and multi-round reasoning, similar to human developers [4][6]. Group 2: Development of Skywork-SWE - Kunlun Wanwei has developed Skywork-SWE as a high-performance model with 32 billion parameters, representing a complete system that integrates data collection, validation, reasoning, and bug fixing [7][18]. - The model was trained on a large-scale, verifiable software engineering dataset, which was constructed through a structured and automated process involving three main phases and nine steps [12][18]. Group 3: Dataset Characteristics - The dataset for Skywork-SWE includes 10,169 real code issues and 8,209 multi-round interaction trajectories, making it one of the largest and highest quality software engineering datasets available [18][20]. - Compared to existing datasets, Skywork-SWE features significantly higher task complexity, with an average of over 2 function modifications and 74 lines of code changes per patch, reflecting real-world software development challenges [20][21]. Group 4: Performance and Scaling Law - Skywork-SWE-32B achieved a 47% accuracy rate on the SWE-bench Verified benchmark, outperforming other models with fewer parameters and even some larger models [25][33]. - The experiments revealed a scaling law in LLM software engineering capabilities, indicating that performance improves with the expansion of training data, with no signs of saturation in the current dataset scale [27][29]. Group 5: Future Implications - The success of Skywork-SWE signifies a shift towards high-quality, task-oriented data as a foundation for training intelligent agents in software engineering, potentially setting a new standard in the industry [40][42]. - Kunlun Wanwei plans to expand the Skywork-SWE dataset to include more programming languages and enhance its capabilities through online reinforcement learning methods [41][42].
人人皆可创作音乐!腾讯AI Lab开源音乐生成大模型SongGeneration
机器之心· 2025-06-20 00:58
Core Viewpoint - Tencent AI Lab has launched and open-sourced the SongGeneration music generation model, addressing common challenges in music AIGC such as sound quality, musicality, and generation speed, achieving superior performance compared to existing models [1][6]. Group 1: Model Performance and Features - SongGeneration significantly enhances sound quality while maintaining generation speed, outperforming many existing open-source models in various dimensions including melody, accompaniment, sound quality, and structure [1][5]. - The model supports features like text control, multi-track synthesis, and style following, catering to both C-end creators and B-end stability and scalability [2][8]. - Compared to traditional rule-based or small models, large model-based music generation shows stronger generalization and generation potential, transitioning AI music creation from "assistance" to "intelligent co-creation" [5][6]. Group 2: Technical Solutions - SongGeneration's architecture includes a music data pipeline and a generation model, utilizing modules for audio separation, structure analysis, and lyric recognition to train on a large dataset of songs [10][12]. - The model employs innovative low bitrate music encoding, achieving high-quality music reconstruction at extremely low bitrates, thus easing the modeling burden on the language model [19][20]. - A multi-category token parallel prediction strategy is introduced to enhance harmony between vocals and accompaniment, improving sound quality and musicality [21]. Group 3: Training Paradigm and Evaluation - SongGeneration adopts a novel three-stage training paradigm: pre-training, modular expansion training, and multi-preference alignment, optimizing music generation based on language models [27][30]. - The evaluation framework combines objective analysis and subjective perception, assessing SongGeneration against commercial and open-source models across multiple key dimensions [29][31]. - In objective assessments, SongGeneration ranks first among open-source models and is competitive with commercial models, showcasing its technical completeness and artistic expressiveness [32][33]. Group 4: User Experience and Accessibility - SongGeneration is available on Hugging Face for online experience, with all model weights and code open-sourced for community engagement and feedback [36].