Workflow
机器之心
icon
Search documents
Mistral再开源!发布代码模型Devstral 2及原生CLI,但大公司被限制商用
机器之心· 2025-12-10 05:10
Core Insights - Mistral AI has launched its next-generation code model series, Devstral 2, which includes two models: Devstral 2 (123B parameters) and Devstral Small 2 (24B parameters) [1][2] - The rapid release of these models, following the Mistral 3 series, indicates a strong momentum in the European AI landscape [4] - Mistral AI's expansion in Europe and the return of Turing Award winner Yann LeCun to Europe for entrepreneurship suggest a promising future for AI in the region [5] Model Highlights - Devstral 2 is a state-of-the-art (SOTA) programming model with 123 billion parameters and a context window of 256K, achieving a score of 72.2% on SWE-bench Verified, establishing it as one of the best open-weight models [9] - Devstral Small 2, with 24 billion parameters, scored 68.0% on SWE-bench Verified, demonstrating competitive performance while being lightweight enough for local deployment on consumer-grade hardware [10][11] - Devstral 2 outperforms DeepSeek V3.2 with a win rate of 42.8% and a loss rate of 28.6%, although it still trails behind the closed-source model Claude Sonnet 4.5 [15] Licensing and Usage - Devstral 2 is released under a modified MIT license, which includes a revenue cap clause that restricts companies with a global consolidated monthly revenue exceeding $20 million from exercising rights under this license [18][21] - Companies exceeding this revenue threshold must contact Mistral AI for commercial licensing or use their paid API services [22] Mistral Vibe CLI - Mistral Vibe CLI is an open-source command-line coding assistant powered by Devstral, allowing users to explore, modify, and execute changes across entire codebases using natural language [24][25] - The CLI offers features such as file operations, code search, version control, and command execution, enhancing developer productivity [26][29] - Mistral Vibe CLI is integrated with existing development environments and is available as an extension for IDEs like Zed [30][31] Deployment Recommendations - Devstral 2 is optimized for data center GPUs, requiring at least four H100-level GPUs for deployment, while Devstral Small 2 is designed for single GPU operation and can run on a variety of NVIDIA systems [33][34] - Devstral Small 2 can also operate on consumer-grade GPUs and CPU configurations without dedicated GPUs, making it accessible for a wider range of users [35]
一手实测 | 智谱AutoGLM重磅开源: AI手机的「安卓时刻」正式到来
机器之心· 2025-12-10 05:10
Core Viewpoint - The article discusses the launch of Open-AutoGLM, an open-source AI assistant framework that enables users to automate tasks on their smartphones using natural language commands, marking a significant advancement in AI technology and user interaction [6][10][42]. Group 1: Introduction to AutoGLM - AutoGLM is a project developed by Zhipu AI, aiming to create an intelligent agent that can not only "speak" but also "act" on smartphones, representing a milestone in AI's ability to use tools [12]. - The framework consists of a Phone Agent and a 9B model, AutoGLM-Phone-9B, which allows for complex task automation through voice and touch commands [6][19]. Group 2: Technical Implementation - The Phone Agent relies on three core technologies: ADB (Android Debug Bridge) for device control, a visual-language model (VLM) for understanding screen content, and intelligent planning for task execution [17][18][19]. - AutoGLM's ability to analyze UI layouts and perform actions like a human is a key feature that distinguishes it from traditional automation scripts [12][31]. Group 3: Practical Applications - The article provides examples of AutoGLM successfully executing tasks such as sending messages and updating applications, demonstrating its robust performance and adaptability [22][28][30]. - AutoGLM can handle multi-step operations and interact with various applications, showcasing its versatility as an AI assistant [33]. Group 4: Open Source and Privacy - The open-source nature of Open-AutoGLM allows developers and users to run the AI model locally, ensuring data privacy and transparency [36][39]. - This approach contrasts with existing AI assistants that often rely on cloud processing, which raises concerns about data security [37][38]. Group 5: Industry Impact - The launch of Open-AutoGLM is seen as a potential turning point in the AI assistant market, democratizing access to advanced automation tools and reducing reliance on proprietary platforms [39][42]. - The article suggests that this development could lead to a new era of human-computer interaction, where AI assistants become integral to everyday tasks [42].
Percept-WAM:真正「看懂世界」的自动驾驶大脑,感知到行动的一体化模型
机器之心· 2025-12-10 02:09
Core Viewpoint - The article discusses the limitations of current large visual language models (VLMs) in autonomous driving, emphasizing the need for enhanced spatial perception and geometric understanding to support robust decision-making in real-world scenarios [2][3]. Group 1: Model Introduction - A new model named Percept-WAM (Perception-Enhanced World–Awareness–Action Model) has been proposed, aiming to integrate perception, world awareness, and vehicle action into a cohesive framework for autonomous driving [3][4]. - Percept-WAM is designed to create a complete link from perception to decision-making, addressing the shortcomings of existing models that struggle with real-world complexities [3][4]. Group 2: Model Architecture - The architecture of Percept-WAM incorporates a general reasoning VLM backbone while introducing World-PV and World-BEV tokens to unify 2D/3D perception representations [5]. - The model employs a grid-conditioned prediction mechanism and IoU-aware confidence outputs to enhance the accuracy and efficiency of its outputs, along with a lightweight action decoding head for efficient trajectory prediction [5][6]. Group 3: Training Tasks - Percept-WAM is trained using multi-view streaming video, LiDAR point clouds (optional), and text queries, optimizing various tasks such as 2D detection, instance segmentation, semantic segmentation, and 3D detection [6][9]. - The model's training approach allows for joint optimization across multiple tasks, enhancing the overall performance through shared geometric and semantic information [23]. Group 4: Performance Evaluation - In public benchmarks, Percept-WAM demonstrates competitive performance in PV perspective perception, BEV perspective perception, and end-to-end trajectory planning compared to existing models [21][30]. - Specifically, in the PV perspective, Percept-WAM achieves a 49.9 mAP in 2D detection, surpassing the performance of specialized models like Mask R-CNN [22][24]. - In the BEV perspective, the model achieves a 58.9 mAP in 3D detection, outperforming traditional BEV detection methods [27][28]. Group 5: Confidence Prediction - The introduction of IoU-based confidence prediction significantly improves the alignment between predicted confidence scores and actual localization quality, enhancing the reliability of dense detection [25]. Group 6: Decision-Making Integration - Percept-WAM integrates World–Action tokens for action and trajectory prediction, allowing for a seamless transition from world modeling to decision output, thus aligning perception and planning in a unified representation space [16][17]. - The model employs a query-based trajectory prediction method that leverages multiple feature groups, enhancing the efficiency and accuracy of trajectory planning [19]. Group 7: Future Implications - Percept-WAM represents a forward-looking evolution in autonomous driving, emphasizing the importance of a unified model that can perceive, understand, and act within the world, moving beyond traditional models that merely process language [41].
大模型「有心了」:首个情感大模型Echo-N1,32B胜过200B
机器之心· 2025-12-10 02:09
Core Insights - The article discusses the breakthrough of Team Echo in developing the first emotional large model, Echo-N1, which successfully applies reinforcement learning (RL) to the subjective domain of emotions, overcoming the limitations of traditional models [3][10]. Group 1: Emotional Model Challenges - Traditional large language models (LLMs) struggle with emotional understanding, often providing generic responses that lack depth [2]. - Existing models face three main issues: inability to quantify emotions, reward hacking leading to superficial responses, and evaluation distortion where models cannot distinguish human-like expressions from AI-generated ones [7][8]. Group 2: Innovations in Emotional Training - Team Echo introduced a new training method that incorporates a "heart" into RL, resulting in Echo-N1 achieving a success rate of 46.7% in emotional tasks, significantly outperforming other models [10]. - The team proposed an "Empathy Psychophysical Model" (EPM) that quantifies empathy, transforming it into a calculable physical process [19][22]. Group 3: Generative Reward Model - Echo-N1 utilizes a generative reward model that requires the model to generate a logical emotional reasoning path before producing responses, enhancing the accuracy of emotional feedback [14][15]. - The model incorporates human-like rewards and empathy rewards to ensure responses are context-aware and resonate with users' emotional needs [16]. Group 4: Evaluation and Performance - The evaluation of AI empathy has shifted from static scoring to dynamic interaction assessments, with EPM providing a scientific measure for empathy and healing [18][19]. - In rigorous testing, the base model Qwen3-32B failed with a 0% success rate, while Echo-N1 excelled, demonstrating the necessity of specialized training for genuine empathetic capabilities [26][30]. Group 5: Future Implications - The emergence of Echo-N1 indicates that AI's emotional intelligence can be quantified and optimized, paving the way for more emotionally aware AI companions [37][39]. - This research opens new possibilities for applying RL in subjective and unquantifiable areas, potentially transforming AI interactions into more meaningful experiences [38].
Light-X来了!全球首个「镜头×光照」双控4D视频生成框架,单目视频秒变电影级
机器之心· 2025-12-09 08:41
Core Insights - The article introduces Light-X, the world's first 4D video generation framework that allows for dual control of camera movement and lighting in single-view videos, enabling users to re-direct and adjust lighting conditions post-capture [2][32] - Light-X addresses the challenge of simultaneously controlling both camera trajectory and lighting, which has not been effectively solved in previous research [7][32] Research Background - The visual experience in the real world is composed of geometry, motion, and lighting, while single-view videos are merely 2D projections of this complex four-dimensional space [5] - The ability to control camera position and lighting conditions after filming can significantly enhance applications in film production, virtual shooting, and AR/VR content generation [5] Methodology - Light-X's core approach involves decoupling camera control from lighting control, then integrating them within a diffusion model to achieve dual controllability in single-view videos [10][32] - The framework constructs two branches from the input video: one for dynamic point clouds (camera control) and another for re-lighting point clouds (lighting control), successfully decoupling these factors during modeling [11] Data Construction - Light-X requires paired geometric alignment, multi-lighting, and multi-view training data, which is scarce in the real world. To address this, Light-Syn was developed to automatically synthesize training data from single-view videos [15][32] - The data pipeline incorporates various video sources to ensure the model learns realistic motion structures and adapts to diverse lighting styles [19] Experimental Results - Light-X was evaluated on two core tasks: joint control of camera and lighting, and video re-lighting, outperforming existing methods in image quality, video smoothness, and user preference [25][32] - In the joint control task, Light-X achieved a FID score of 101.06, significantly better than previous methods, demonstrating superior image quality and user satisfaction [27] Ablation Studies - Ablation studies indicate that multi-source data is crucial for enhancing new view quality, motion stability, and lighting diversity, while fine-grained lighting cues and global lighting control improve consistency and stability [30][32] Conclusion - Light-X represents a significant advancement in video generation technology, enabling simultaneous control of camera movement and lighting, with extensive experimental validation showing its superiority over existing methods [32]
地平线首曝BPU「黎曼」架构,用数学流形重构AI计算
机器之心· 2025-12-09 08:41
Core Viewpoint - The article discusses the evolution and advancements of Horizon Robotics under the leadership of founder Yu Kai, highlighting the transition from digital intelligence to physical intelligence and the development of new AI architectures and algorithms aimed at enhancing autonomous driving capabilities and robotics [1][2]. Group 1: Company Evolution and Milestones - In December 2012, a significant auction for a deep learning team took place, where Yu Kai represented Baidu, competing against Google, Microsoft, and DeepMind, marking a pivotal moment in AI history [1]. - Horizon Robotics was officially registered on July 14, 2015, coinciding with NASA's New Horizons mission, symbolizing the company's commitment to reaching new heights in AI computing [2]. Group 2: Technological Advancements - At the 2025 Horizon Technology Ecosystem Conference, Horizon Robotics unveiled its full-scene intelligent driving (HSD) production capabilities and introduced the new BPU "Riemann" architecture, aiming to build a foundational "Wintel" ecosystem for physical AI [4]. - The BPU architecture has evolved significantly, with a tenfold increase in key operator performance and a tenfold increase in high-precision operator support, targeting L4/L5 level autonomous driving [7]. - The new Riemann architecture aims to simplify complex real-world structures, enhancing efficiency and performance in AI applications [7]. Group 3: Compiler Innovations - Horizon Robotics introduced the fourth-generation compiler "OpenExplorer 4.0," which incorporates AI-driven optimization strategies, including reinforcement learning and Monte Carlo tree search, to overcome traditional compiler limitations [8][9]. - The new compiler has reduced compilation time from hours to minutes and improved model performance by 20%, optimizing end-to-end latency in HSD applications [12]. Group 4: Business Model Transformation - Horizon Robotics launched the HSD Together model, transitioning from a traditional chip-selling approach to providing a comprehensive algorithm service, allowing partners to leverage a validated intelligent driving system [13][14]. - This model aims to significantly reduce costs and time for partners, enabling them to focus on integration and differentiation while lowering operational expenses by up to 90% [14]. Group 5: Market Accessibility - Horizon Robotics aims to democratize advanced driving assistance systems, targeting the 100,000 RMB market segment, which constitutes a significant portion of the Chinese automotive market [16]. - The company demonstrated that a single Journey 6M chip can effectively handle complex urban driving scenarios, emphasizing cost-effectiveness and adaptability for both electric and traditional fuel vehicles [18][19]. Group 6: Robotics Ecosystem Development - Yu Kai emphasized the importance of mastering autonomous driving as a foundation for future robotics, defining intelligent driving models as the starting point for physical AI [21]. - Horizon Robotics introduced open-source initiatives for its robotics business, including HoloMotion and HoloBrain, aimed at enhancing motion and operational intelligence in robots [20][25]. - HoloMotion has been made available on GitHub, with institutions like Stanford and Tsinghua already utilizing it, indicating a strong interest in developing embodied intelligence [27].
谷歌TPU杀疯了,产能暴涨120%、性能4倍吊打,英伟达还坐得稳吗?
机器之心· 2025-12-09 08:41
Core Viewpoint - Google's TPU is set to disrupt Nvidia's dominance in the AI chip market, with significant production increases and cost advantages for inference tasks [2][4][79]. Group 1: TPU Production and Market Strategy - Morgan Stanley predicts that Google's TPU production will surge to 5 million units by 2027 and 7 million by 2028, a substantial increase from previous estimates of 3 million and 3.2 million units, representing a 67% and 120% upward adjustment respectively [2]. - Google aims to sell TPUs to third-party data centers, complementing its Google Cloud Platform (GCP) business, while still utilizing most TPUs for its own AI training and cloud services [2][3]. Group 2: Comparison with Nvidia's GPU - Nvidia has historically dominated the AI chip market, controlling over 80% of it by 2023, but faces challenges as the market shifts from training to inference, where Google's TPU offers superior efficiency and cost advantages [8][12]. - By 2030, inference is expected to consume 75% of AI computing resources, creating a market worth $255 billion, growing at a CAGR of 19.2% [8][52]. Group 3: Cost and Efficiency Advantages of TPU - Google's TPU is designed for inference, providing a cost per hour of $1.38 compared to Nvidia's H100 at over $2.50, making TPU 45% cheaper [20]. - TPU's performance in inference tasks is four times better per dollar spent compared to Nvidia's offerings, and it consumes 60-65% less power [20][22]. Group 4: Industry Trends and Client Migration - Major AI companies are transitioning from Nvidia GPUs to Google's TPUs to reduce costs significantly; for instance, Midjourney reported a 65% reduction in costs after switching to TPU [34]. - Anthropic has committed to a deal for up to 1 million TPUs, highlighting the growing trend of companies seeking cost-effective solutions for AI workloads [35]. Group 5: Future Implications for Nvidia - Nvidia's profit margins, currently between 70-80%, may face pressure as Google captures even a small portion of the inference workload, potentially leading to over $6 billion in annual profit loss for Nvidia [22][59]. - The shift towards TPUs indicates a broader trend where companies are diversifying their AI infrastructure, reducing reliance on Nvidia's products [67].
没了遥控器,还被扔进荒野,具身智能该「断奶」了
机器之心· 2025-12-09 03:17
Core Viewpoint - The article discusses the challenges faced by humanoid robots in real-world scenarios, emphasizing that their capabilities have been overestimated and that significant advancements are still required for practical applications [11][61]. Group 1: Robot Performance in Real-World Scenarios - Humanoid robots struggle with tasks in outdoor environments, often failing to perform basic functions without remote control [9][11]. - The ATEC 2025 competition highlighted the limitations of robots in navigating complex terrains and performing tasks autonomously, with many relying on remote operation [30][32]. - Successful completion of tasks by some teams demonstrated that traditional methods combined with advanced technology can yield better results than relying solely on large models [26][50]. Group 2: Technical Challenges - Robots face significant difficulties in perception and decision-making, particularly in varying light conditions that affect their sensors [14][21]. - The complexity of physical interactions, such as grasping objects with different textures and colors, poses a challenge for robots due to their lack of tactile feedback [23][56]. - The integration of various computational units (CPU, GPU, NPU) in a compact and efficient manner remains a significant hurdle for robotic systems [52][56]. Group 3: Future Directions and Industry Insights - Experts believe that for robots to be integrated into human environments, they must develop capabilities in mobility, manipulation, and environmental modification [61][66]. - The article suggests that failures in robotic tasks are essential for progress, as they reveal weaknesses that need to be addressed for future advancements [65][66]. - The future of artificial general intelligence (AGI) is expected to involve a deeper integration of machine intelligence with the physical world, moving beyond data recognition to environmental interaction and action execution [66].
Snapchat提出Canvas-to-Image:一张画布集成 ID、姿态与布局
机器之心· 2025-12-09 03:17
Core Viewpoint - Canvas-to-Image is a new framework for compositional image generation that integrates various control signals into a single canvas, simplifying the image generation process by allowing users to provide multiple types of control information simultaneously [2][9][31] Group 1: Traditional Control Limitations - Traditional image generation methods utilize independent input paths for identity reference, pose sketches, and layout boxes, leading to a fragmented and lengthy process [7][8] - Users are unable to overlay multiple control signals in the same area of an image, which restricts the complexity of scene construction [8][9] Group 2: Canvas-to-Image Methodology - The Canvas-to-Image framework consolidates all control signals onto a single canvas, allowing the model to interpret and execute them within the same pixel space [9][10] - The multi-task canvas serves as both the user interface and the model's input, enabling the integration of heterogeneous visual symbols and their spatial relationships [14] Group 3: Training and Inference Process - During training, the model learns from cross-frame image sets, which introduces significant variations in pose, lighting, and expression, preventing it from relying on simple copy mechanisms [15] - In the inference phase, users can flexibly combine multiple control modalities on the same canvas, allowing for complex scene generation without switching between different modules [16] Group 4: Experimental Results - Canvas-to-Image can simultaneously handle identity, pose, and layout box controls, outperforming baseline methods that often fail under similar conditions [18] - The model maintains spatial and semantic relationships between characters and objects, generating scenes with natural interactions and coherence even under complex control settings [20][21] Group 5: Conclusion - The core value of Canvas-to-Image lies in its ability to visualize multi-modal generation controls, making complex scene construction intuitive through direct manipulation on the canvas [31]
全图与切片并非等价?LLaVA-UHD-v3揭示差异推出高效全图建模方案
机器之心· 2025-12-09 03:17
Core Insights - The article discusses the advancements in multimodal large models (MLLMs) and the introduction of LLaVA-UHD v3, which addresses the challenge of efficiently processing high-resolution images while maintaining global understanding capabilities [2][3][10]. Group 1: Introduction of LLaVA-UHD v3 - LLaVA-UHD v3 introduces a new progressive visual compression framework (PVC) that consists of two core components: Refined Patch Embedding (RPE) and Windowed Token Compression (WTC) [4][10]. - The PVC framework significantly reduces the number of visual tokens while preserving global semantic consistency, enhancing the efficiency of native high-resolution visual encoding [4][10]. Group 2: Comparison of Encoding Methods - The research team conducted a fair comparison between slice-based encoding (SBE) and global native-resolution encoding (GNE) using the same model architecture, training data, and evaluation protocols [5]. - GNE demonstrated a notable advantage in spatial perception and localization tasks, with an average improvement of approximately 11.0% over SBE [6]. - In general visual-language understanding tasks, GNE outperformed SBE by about 2.1%, indicating that GNE is more suitable for tasks requiring spatial awareness and high-resolution understanding [7]. Group 3: Efficiency and Performance of LLaVA-UHD v3 - The PVC architecture allows for a significant reduction in computational load while maintaining model capabilities, achieving a 2.4× acceleration compared to MoonViT and 1.9× faster than Qwen2.5-ViT [16]. - LLaVA-UHD v3 was trained on approximately 20 million image-text pairs, which is significantly lower than competitors like Qwen2-VL (700 million) and MiniCPM-V2.6 (460 million), yet it remains highly competitive across various visual-language benchmarks [17]. - The model achieved a visual token compression rate of 64×, surpassing competitors, while still performing comparably or better in tasks requiring fine-grained visual information [17]. Group 4: Future Directions - The article emphasizes the need for further exploration of visual encoding pre-training strategies suitable for multimodal tasks and the gradual introduction of linear complexity operators to replace traditional quadratic complexity attention mechanisms [20].