Workflow
机器之心
icon
Search documents
Mamba作者团队提出SonicMoE:一个Token舍入,让MoE训练速度提升近2倍
机器之心· 2025-12-19 06:38
Core Insights - The MoE (Mixture of Experts) model has become the standard architecture for scaling language models without significantly increasing computational costs, showing trends of higher expert granularity and sparsity, which enhance model quality per unit FLOPs [1][2] MoE Model Trends - Recent open-source models like DeepSeek V3, Kimi K2, and Qwen3 MoE exhibit finer-grained expert designs and higher sparsity, significantly increasing total parameter count while maintaining the number of active parameters [1][2] - The table of recent models indicates varying parameters, expert activation ratios, and expert granularities, with models like Mixtral 8x22B having 131 billion parameters and a 25% expert activation ratio [2] Hardware Efficiency Challenges - The pursuit of extreme granularity and sparsity in MoE designs has led to significant hardware efficiency issues, prompting the development of SonicMoE, a solution tailored for NVIDIA Hopper and Blackwell architecture GPUs [3] - SonicMoE demonstrates performance advantages, achieving a 43% speed increase in forward propagation and up to 115% in backward propagation compared to existing baselines [3] Memory and IO Bottlenecks - Fine-grained MoE models face linear growth in activation memory usage with the number of active experts, leading to increased memory pressure during forward and backward propagation [4] - The reduced arithmetic intensity in smaller, dispersed experts results in more frequent IO access, pushing model training into a memory-constrained zone [4] Efficient Algorithms - SonicMoE introduces a method to compute routing gradients without caching activation values, reducing backward propagation memory usage by 45% for fine-grained models [4] - The design allows for overlapping computation and IO operations, effectively masking high IO latency associated with fine-grained MoE [4] Token Rounding Strategy - The token rounding method optimizes the distribution of tokens to experts, minimizing computational waste due to tile quantization effects, thus enhancing training efficiency without compromising model quality [4][20][26] Performance Metrics - SonicMoE achieves a training throughput of 213 billion tokens per day using 64 H100 GPUs, comparable to the efficiency of 96 H100 GPUs running ScatterMoE [6] - The memory usage for activation remains constant even as expert granularity increases, with efficiency improvements ranging from 0.20 to 1.59 times over existing baselines [9][15] Open Source Contribution - The team has open-sourced the relevant kernel code, providing a robust tool for the large model community to accelerate high-performance MoE training [7]
拆解CANN:当华为决定打开算力的「黑盒」
机器之心· 2025-12-19 06:38
Core Viewpoint - The article discusses Huawei's recent announcement regarding the open-source of its Ascend CANN software, which aims to lower the barriers for AI tool development and foster a new AI computing ecosystem [2][30]. Group 1: CANN Open Source and Developer Empowerment - CANN, which stands for Compute Architecture for Neural Networks, serves as a bridge between AI training frameworks and underlying AI chips, allowing developers to utilize computing power without needing to understand chip details [2][5]. - The open-source nature of CANN has garnered significant attention in the industry, as it empowers developers to define computing capabilities and customize their AI models [2][6]. - CANN supports seamless integration with major AI frameworks such as PyTorch, TensorFlow, MindSpore, and PaddlePaddle, enhancing developer flexibility [5][6]. Group 2: Development Paths Offered by CANN - CANN provides three development paths for different types of developers: 1. For those familiar with Python, CANN integrates with the Triton ecosystem, allowing easy migration of existing code [9]. 2. For system-level programmers seeking high performance, Ascend C offers low-level resource management capabilities [10]. 3. For developers looking for ease of use, the CATLASS operator template library simplifies the creation of matrix multiplication operators [11][13]. - The MLAPO fusion operator, part of the CATLASS library, significantly reduces computation time and enhances performance in large models [15]. Group 3: Architectural Innovations - CANN's architecture features a layered decoupling approach, allowing independent evolution of components, which reduces integration complexity for developers [21][22]. - The decoupling enables developers to selectively upgrade specific components based on their needs, facilitating easier customization and integration [23][29]. - CANN has transitioned from a monolithic software structure to a modular one, with independent components for various functionalities, enhancing flexibility and performance [24][26]. Group 4: Open Source Community and Growth - The open-source initiative of CANN is actively progressing, with over 27 sub-projects and a total of more than 3,700 stars on its repositories [35]. - The community-driven approach invites developers to contribute, thereby expanding the ecosystem and enhancing the technology's value through collaborative efforts [31][32]. - CANN's repositories include a variety of core libraries and tools, providing developers with ready-to-use resources for AI application development [16][36].
T5Gemma模型再更新,谷歌还在坚持编码器-解码器架构
机器之心· 2025-12-19 03:42
Core Viewpoint - Google has recently intensified its model releases, introducing the Gemini 3 Flash and the unexpected T5Gemma 2, which builds on the capabilities of the Gemini 3 series [1][3]. Group 1: T5Gemma 2 Overview - T5Gemma 2 is a new generation encoder-decoder model that is the first to support multi-modal and long-context capabilities, built on the robust features of Gemini 3 [9]. - The model offers three pre-trained scales: 270M-270M, 1B-1B, and 4B-4B, and is the first high-performance encoder-decoder model in the community to support ultra-long contexts of up to 128K tokens [9][11]. Group 2: Innovations and Upgrades - T5Gemma 2 continues the adaptation training approach of T5Gemma, converting a pre-trained decoder model into an encoder-decoder model, while leveraging key innovations from Gemini 3 to extend into the visual-language model domain [13]. - Significant architectural innovations include: 1. Shared word embeddings between the encoder and decoder, reducing overall parameter count and allowing for more effective capabilities within the same memory footprint [15]. 2. Merging self-attention and cross-attention into a unified attention layer, enhancing model parallelization efficiency and inference performance [16] [15]. Group 3: Model Capabilities - T5Gemma 2 demonstrates significant upgrades in capabilities: 1. Multi-modal capability, enabling the model to understand and process both images and text, facilitating tasks like visual question answering and multi-modal reasoning [17]. 2. Extended context support, with the ability to handle context windows of up to 128K tokens through a local-global alternating attention mechanism [18]. 3. Large-scale multilingual support, capable of operating in over 140 languages due to training on larger and more diverse datasets [19]. Group 4: Performance Results - T5Gemma 2 sets a new standard for compact encoder-decoder models, showing outstanding performance in key capability areas and inheriting the powerful multi-modal and long-context features of Gemini 3 [21]. - In benchmark tests, T5Gemma 2 outperforms both Gemini 3 and T5Gemma in multi-modal performance, long-context capability, and overall general capabilities across various tasks [25][29].
基于真实数据和物理仿真,国防科大开源具身在线装箱基准RoboBPP
机器之心· 2025-12-19 03:42
Core Insights - The article discusses the importance of physical feasibility and embodied executability in the 3D bin packing problem (3D-BPP) for modern industrial logistics and robotic automation, highlighting the need for a unified benchmark system to evaluate algorithm performance and real-world applicability [2][31] - RoboBPP, a comprehensive benchmarking system developed by several academic institutions, aims to address existing challenges by utilizing real industrial data, physical simulation, and embodied execution modeling [3][31] Benchmark System Overview - RoboBPP includes a physics-based high-fidelity simulator that replicates the industrial bin packing process using real-scale boxes and industrial robotic arms, allowing for effective evaluation of algorithms under realistic conditions [3][12] - The system features multiple categories of benchmarks, including overall algorithm performance rankings and detailed metrics across various test settings and datasets [7] Testing Framework - The testing framework consists of three progressive settings: Math Pack (pure geometric placement), Physics Pack (introducing physical constraints), and Execution Pack (full embodied execution with robotic operations) [18] - Each setting is designed to assess algorithm adaptability and robustness under increasing levels of physical realism [17] Evaluation Metrics - A multidimensional evaluation system has been established, incorporating traditional metrics and new execution-related indicators such as Collapsed Placement and Dangerous Operation, which reflect potential risks during the placement process [21][22] - The scoring system normalizes all metrics to provide a comprehensive score, facilitating systematic comparisons of different algorithms [21] Experimental Results - The team conducted extensive experiments across three test settings and three datasets, ranking algorithms based on their overall scores and analyzing performance across different industrial scenarios [24][25] - Algorithms that prioritize compact and efficient space utilization tend to achieve higher occupancy rates, while those that focus on stability and physical feasibility exhibit lower collapse rates [28][33] Dataset Diversity - The real industrial datasets used in RoboBPP capture the diversity of item sizes, shapes, and arrival sequences, which are critical for evaluating the embodied executability of algorithms [15] - Three representative task scenarios were identified: Repetitive Dataset (consistent item sizes), Diverse Dataset (varied item sizes), and Wood Board Dataset (irregular shapes) [15] Conclusion - RoboBPP represents the first comprehensive benchmarking system for robotic online 3D bin packing tasks, combining real industrial data, physical simulation, and embodied execution assessment, thus providing a reliable and realistic evaluation framework for future research and industrial applications [31]
亚马逊AGI负责人离职,强化学习大佬Pieter Abbeel接任
机器之心· 2025-12-19 00:21
Core Viewpoint - Rohit Prasad, the Senior Vice President and Chief Scientist of Amazon's AGI team, has announced his departure, marking a significant leadership change in Amazon's AI initiatives [1][3][4]. Group 1: Leadership Changes - Rohit Prasad joined Amazon in 2013 and played a crucial role in developing Alexa and leading the Nova foundational model project [3][4]. - Following Prasad's exit, Amazon will centralize AI research under the cloud computing division (AWS), with Peter DeSantis appointed to lead a new organization that will report directly to CEO Andy Jassy [5][6]. Group 2: AI Development Focus - Amazon aims to enhance its AI product development to compete with OpenAI, Google, and Anthropic, having launched its own foundational model series, Nova, and developed custom AI chips, Trainium, to rival Nvidia [5]. - The new department led by Peter DeSantis will oversee the development of core models, support for self-developed chip initiatives, and exploration of quantum computing technologies [10][12]. Group 3: New Appointments - Pieter Abbeel, a leading AI researcher and co-founder of Covariant, will take over the leadership of the foundational model research team, focusing on advancing Amazon's AI research [12][17]. - Abbeel's extensive background in AI and robotics positions him well to drive innovation and collaboration within Amazon's AI initiatives [12][15]. Group 4: Employment Perspectives - AWS CEO Matt Garman expressed confidence that AI will create more jobs than it displaces, emphasizing the importance of nurturing new talent to fill high-value roles in the future [19][20]. - Garman highlighted that junior developers, who are more adept at using AI tools, will play a crucial role in the evolving tech landscape, countering the notion that AI will replace entry-level positions [20].
OpenAI最强代码模型GPT-5.2-Codex上线
机器之心· 2025-12-19 00:21
Core Insights - OpenAI has released GPT-5.2-Codex, the most advanced coding model designed for complex software engineering tasks, enhancing instruction-following and long-context understanding capabilities [1][3] Group 1: Model Enhancements - GPT-5.2-Codex improves upon GPT-5.2 with better instruction adherence and long-term context comprehension, particularly excelling in large code changes like refactoring and migration [3] - The model shows significant improvements in token efficiency for coding tasks, especially at medium and high reasoning levels, becoming a primary tool for the Codex team [3] - Enhanced capabilities in network security have been noted, with GPT-5.2-Codex outperforming all previous OpenAI models in this area [6][7] Group 2: Performance Metrics - GPT-5.2-Codex achieved state-of-the-art performance in SWE-Bench Pro and Terminal-Bench 2.0 benchmarks, which assess AI agents in real terminal environments [8][10] - The model can efficiently handle large codebases and maintain context over long sessions, enabling reliable completion of complex tasks like large-scale refactoring and feature building [8] Group 3: Security Applications - A security researcher utilized GPT-5.1-Codex-Max and Codex CLI to discover a vulnerability in React, demonstrating the model's application in real-world vulnerability research [6][21] - The process involved using Codex to set up a local testing environment and analyze potential attack surfaces, leading to the discovery of previously unknown vulnerabilities [22][25] Group 4: Deployment and Access - GPT-5.2-Codex is currently available to paid ChatGPT users and will be accessible to API users in the coming weeks, with OpenAI piloting access for invited users and professionals focused on defensive cybersecurity [7] - OpenAI is planning to ensure that each new model meets high cybersecurity capability standards, emphasizing responsible deployment alongside enhanced security measures [18][25]
不卖「工具」卖「生产力」,百融云创如何用「硅基员工」打破AI落地僵局?
机器之心· 2025-12-18 10:15
Core Viewpoint - The article discusses the transition of AI from being a mere tool to becoming a productive workforce, termed "silicon-based employees," emphasizing the need for a new business model where AI vendors are compensated based on measurable business outcomes rather than just software sales [1][3][4]. Group 1: AI Implementation Challenges - Despite initial successful deployments of AI agents in various enterprises, many remain in the demo stage and fail to integrate into core business processes, highlighting a significant gap between AI vendors and enterprise clients [1][2]. - The misalignment of incentives between AI solution providers and enterprise clients is a fundamental issue, where vendors focus on selling tools while clients seek tangible productivity improvements [2][3]. Group 2: New Business Model - The key to overcoming these challenges lies in restructuring the commercial contract between AI vendors and enterprise clients, shifting from tool provision to delivering measurable business results [3][12]. - The concept of "Result as a Service" (RaaS) is introduced, where clients pay based on the performance of AI agents in achieving specific business outcomes [5][7]. Group 3: Technological Advancements - BaiRong Cloud has made significant strides in AI technology, focusing on practical applications in various industries, including finance, marketing, and customer service [9][10]. - The company has developed proactive AI models that can guide user interactions and eliminate errors, ensuring compliance and enhancing decision-making capabilities [14][15]. Group 4: Real-World Applications - BaiRong Cloud's "silicon-based employees" have been successfully deployed in multiple sectors, demonstrating their ability to generate substantial business results, such as increasing active customers and managing large-scale marketing campaigns [17][18]. - The introduction of these AI agents has led to significant cost reductions and efficiency improvements in various operational processes, including customer service and recruitment [19][20]. Group 5: Future Directions - The company aims to continue evolving its AI capabilities, focusing on long-term task execution and enhancing the collaborative relationship between human and AI workers [22][24]. - As technology matures, "silicon-based employees" are expected to transition from performing repetitive tasks to becoming integral decision-making participants within organizations [24][25].
与Physical Intelligence同日发声:深度机智亮出「情境数采」杀手锏,具身智能的通用性天花板要被捅破了?
机器之心· 2025-12-18 10:15
机器之心发布 具身智能通往通用性的征途,正被 "数据荒漠" 所阻隔。当模型在模拟器中刷出高分,却在现实复杂场景中频频 "炸机" 时,行业开始反思:我 们喂给机器人的数据,是否真的包含人类操作的精髓?近日,深度机智在以人类第一视角为代表的真实情境数据,筑牢物理智能基座,解决具 身智能通用性难题的道路上又有重要举措。 具身智能的 "数据之困":从机械模仿到逻辑理解的鸿沟 具身智能的通用性突破,始终受限于物理世界交互数据的极度稀缺。尽管合成数据与离线遥操作提供了初步养料,但采集效率低、场景单一化、任务真实性弱等 瓶颈,导致模型极易陷入 过拟合 的泥潭 —— 机器人往往只是学会了死记硬背特定的轨迹,而非习得举一反三的操作逻辑。 这一行业痛点,正被深度机智一直倡导的 "第一视角人类经验" 的情境数采(In-Context Data Collection)模式所破解。这种模式主张:数据不应是孤立的动作切 片,而应是带有丰富环境语境与因果关系的逻辑流。 作为北京中关村学院和中关村人工智能研究院(以下简称中关村两院)孵化的第一家高科技企业,深度机智自去年底筹办伊始,就在中关村两院支持下深入开展 以人类第一视角数据为核心的物理 ...
SIGGRAPH Asia 2025 | 只用一部手机创建和渲染高质量3D数字人
机器之心· 2025-12-18 10:15
Core Insights - The article discusses the advancements in 3D digital human reconstruction and rendering technology, specifically focusing on the HRM²Avatar system developed by the Taobao technology - Meta team, which allows for high-fidelity, real-time 3D digital humans to be created using only a smartphone [4][5][6]. Group 1: Technology Overview - HRM²Avatar is a system designed for high-fidelity real-time 3D digital human reconstruction and rendering, utilizing a two-stage capture method and a combination of explicit clothing mesh representation and Gaussian-based dynamic detail modeling [12][36]. - The system allows for the reconstruction of human figures, clothing structures, and detailed appearances under ordinary smartphone conditions, achieving a balance between visual realism, cross-pose consistency, and mobile real-time rendering [6][12]. Group 2: Methodology - The capture process involves both static and dynamic scanning phases, where users maintain a fixed pose for static scans and perform natural movements for dynamic scans, enabling the system to capture necessary signals for reconstruction and dynamic modeling [18][28]. - The system employs a mixed representation approach, attaching Gaussian points to the clothing mesh to provide controllable parameters for pose-related deformations and lighting modeling [40][46]. Group 3: Performance Evaluation - HRM²Avatar has been tested on mobile devices, achieving stable real-time performance with approximately 530,000 Gaussian points at 2K resolution and 120 FPS on the iPhone 15 Pro Max, and 2K at 90 FPS on Apple Vision Pro [87][89]. - Comparative evaluations show that HRM²Avatar outperforms existing methods in static reconstruction quality and appearance consistency under pose variations, as evidenced by higher PSNR and SSIM scores [76][80]. Group 4: Future Directions - The article emphasizes the ongoing need for optimization, particularly in handling complex clothing structures and extreme lighting conditions, indicating that HRM²Avatar is a significant milestone in making high-quality digital humans accessible to ordinary users [90].
告别抽卡!一手实测字节刚放出的视频模型Seedance 1.5 pro
机器之心· 2025-12-18 09:08
Core Viewpoint - The article discusses the launch of the latest video generation model, Seedance 1.5 pro, by Volcano Engine, highlighting its advanced capabilities in audio-visual synchronization and multi-language support, which significantly enhance video generation quality and user experience [2][5][46]. Group 1: Model Features and Capabilities - Seedance 1.5 pro achieves native audio-visual synchronization, covering various sound types and demonstrating a leading synchronization rate globally [5]. - The model can follow complex instructions and supports multiple languages and dialects, improving narrative coherence and emotional expression in generated videos [5][13]. - In video capability assessments, Seedance 1.5 pro outperforms competitors in multiple metrics, including alignment, aesthetic quality, and audio generation [6][53]. Group 2: User Experience and Applications - Seedance 1.5 pro is available for enterprise users via API and for personal users through the Dream Web and Doubao App [8]. - The model has shown high adherence to user prompts, often producing optimal results on the first attempt, reducing the need for multiple iterations [43]. - It is suitable for daily content creation, lightweight commercial advertising, and AI short film production, indicating its versatility in various creative contexts [44]. Group 3: Technical Innovations - The model incorporates several key technological innovations, including a unified multi-modal generation framework and a comprehensive audio-visual data framework, enhancing its performance in generating high-quality content [49][50]. - The training process includes supervised fine-tuning and reinforcement learning from human feedback, leading to significant improvements in motion quality and audio fidelity [52]. - The inference phase has been optimized to achieve over tenfold acceleration in generating content while maintaining model performance [52]. Group 4: Industry Impact and Future Outlook - The advancements in Seedance 1.5 pro reflect the rapid evolution of video generation technology, transitioning from academic research to practical tools for everyday users [58]. - The model's capabilities are expected to bridge the gap between AI-generated content and professional video production, enhancing its applicability in creative industries [59]. - The industry anticipates further developments in video generation models, with expectations for more sophisticated outputs in the coming years [60][61].