机器之心
Search documents
Percept-WAM:真正「看懂世界」的自动驾驶大脑,感知到行动的一体化模型
机器之心· 2025-12-10 02:09
在过去几年,自动驾驶圈流行一句话: 「大模 型会说话,但不会开 车。」 一方面,大规模视觉语言模型(VLM)在文本理解和逻辑推理上突飞猛进;另一方面,一旦把它们放到真实道路上,让它们处理长尾场景、远距离目标和复杂博 弈时,这些 "聪明大脑" 却常常犯低级错误:看不清、定位不准、反应不稳定。深层原因在于 —— 现有 VLM 在空间感知和几何理解上的能力,远远跟不上它们在 语义层面的 "表达能力" 。 为了让大模型真的能 "看懂世界",在很多现有方案中,研究者会在训练中加入一些 "感知类 QA" 问题,比如问 "左前方有没有车""两车距离有多远"。但这类监督 更多停留在语义标签和粗略相对关系层面,并没有让模型真正学会可用于控制决策的强 2D/3D 感知能力 —— 例如精确、稳定的检测框、分割结果和 BEV 感知信 息。换句话说,今天很多 VLA 仍然停留在「会回答关于世界的问题」,而不是「真的看清这个世界」。这种 "弱感知的大模型",显然不足以支撑自动驾驶和广义 具身智能对空间理解的高要求。 近日,来自引望智能与复旦大学的研究团队联合提出了一个面向自动驾驶的新一代大模型 ——Percept-WAM(Percept ...
大模型「有心了」:首个情感大模型Echo-N1,32B胜过200B
机器之心· 2025-12-10 02:09
Core Insights - The article discusses the breakthrough of Team Echo in developing the first emotional large model, Echo-N1, which successfully applies reinforcement learning (RL) to the subjective domain of emotions, overcoming the limitations of traditional models [3][10]. Group 1: Emotional Model Challenges - Traditional large language models (LLMs) struggle with emotional understanding, often providing generic responses that lack depth [2]. - Existing models face three main issues: inability to quantify emotions, reward hacking leading to superficial responses, and evaluation distortion where models cannot distinguish human-like expressions from AI-generated ones [7][8]. Group 2: Innovations in Emotional Training - Team Echo introduced a new training method that incorporates a "heart" into RL, resulting in Echo-N1 achieving a success rate of 46.7% in emotional tasks, significantly outperforming other models [10]. - The team proposed an "Empathy Psychophysical Model" (EPM) that quantifies empathy, transforming it into a calculable physical process [19][22]. Group 3: Generative Reward Model - Echo-N1 utilizes a generative reward model that requires the model to generate a logical emotional reasoning path before producing responses, enhancing the accuracy of emotional feedback [14][15]. - The model incorporates human-like rewards and empathy rewards to ensure responses are context-aware and resonate with users' emotional needs [16]. Group 4: Evaluation and Performance - The evaluation of AI empathy has shifted from static scoring to dynamic interaction assessments, with EPM providing a scientific measure for empathy and healing [18][19]. - In rigorous testing, the base model Qwen3-32B failed with a 0% success rate, while Echo-N1 excelled, demonstrating the necessity of specialized training for genuine empathetic capabilities [26][30]. Group 5: Future Implications - The emergence of Echo-N1 indicates that AI's emotional intelligence can be quantified and optimized, paving the way for more emotionally aware AI companions [37][39]. - This research opens new possibilities for applying RL in subjective and unquantifiable areas, potentially transforming AI interactions into more meaningful experiences [38].
Light-X来了!全球首个「镜头×光照」双控4D视频生成框架,单目视频秒变电影级
机器之心· 2025-12-09 08:41
Core Insights - The article introduces Light-X, the world's first 4D video generation framework that allows for dual control of camera movement and lighting in single-view videos, enabling users to re-direct and adjust lighting conditions post-capture [2][32] - Light-X addresses the challenge of simultaneously controlling both camera trajectory and lighting, which has not been effectively solved in previous research [7][32] Research Background - The visual experience in the real world is composed of geometry, motion, and lighting, while single-view videos are merely 2D projections of this complex four-dimensional space [5] - The ability to control camera position and lighting conditions after filming can significantly enhance applications in film production, virtual shooting, and AR/VR content generation [5] Methodology - Light-X's core approach involves decoupling camera control from lighting control, then integrating them within a diffusion model to achieve dual controllability in single-view videos [10][32] - The framework constructs two branches from the input video: one for dynamic point clouds (camera control) and another for re-lighting point clouds (lighting control), successfully decoupling these factors during modeling [11] Data Construction - Light-X requires paired geometric alignment, multi-lighting, and multi-view training data, which is scarce in the real world. To address this, Light-Syn was developed to automatically synthesize training data from single-view videos [15][32] - The data pipeline incorporates various video sources to ensure the model learns realistic motion structures and adapts to diverse lighting styles [19] Experimental Results - Light-X was evaluated on two core tasks: joint control of camera and lighting, and video re-lighting, outperforming existing methods in image quality, video smoothness, and user preference [25][32] - In the joint control task, Light-X achieved a FID score of 101.06, significantly better than previous methods, demonstrating superior image quality and user satisfaction [27] Ablation Studies - Ablation studies indicate that multi-source data is crucial for enhancing new view quality, motion stability, and lighting diversity, while fine-grained lighting cues and global lighting control improve consistency and stability [30][32] Conclusion - Light-X represents a significant advancement in video generation technology, enabling simultaneous control of camera movement and lighting, with extensive experimental validation showing its superiority over existing methods [32]
地平线首曝BPU「黎曼」架构,用数学流形重构AI计算
机器之心· 2025-12-09 08:41
Core Viewpoint - The article discusses the evolution and advancements of Horizon Robotics under the leadership of founder Yu Kai, highlighting the transition from digital intelligence to physical intelligence and the development of new AI architectures and algorithms aimed at enhancing autonomous driving capabilities and robotics [1][2]. Group 1: Company Evolution and Milestones - In December 2012, a significant auction for a deep learning team took place, where Yu Kai represented Baidu, competing against Google, Microsoft, and DeepMind, marking a pivotal moment in AI history [1]. - Horizon Robotics was officially registered on July 14, 2015, coinciding with NASA's New Horizons mission, symbolizing the company's commitment to reaching new heights in AI computing [2]. Group 2: Technological Advancements - At the 2025 Horizon Technology Ecosystem Conference, Horizon Robotics unveiled its full-scene intelligent driving (HSD) production capabilities and introduced the new BPU "Riemann" architecture, aiming to build a foundational "Wintel" ecosystem for physical AI [4]. - The BPU architecture has evolved significantly, with a tenfold increase in key operator performance and a tenfold increase in high-precision operator support, targeting L4/L5 level autonomous driving [7]. - The new Riemann architecture aims to simplify complex real-world structures, enhancing efficiency and performance in AI applications [7]. Group 3: Compiler Innovations - Horizon Robotics introduced the fourth-generation compiler "OpenExplorer 4.0," which incorporates AI-driven optimization strategies, including reinforcement learning and Monte Carlo tree search, to overcome traditional compiler limitations [8][9]. - The new compiler has reduced compilation time from hours to minutes and improved model performance by 20%, optimizing end-to-end latency in HSD applications [12]. Group 4: Business Model Transformation - Horizon Robotics launched the HSD Together model, transitioning from a traditional chip-selling approach to providing a comprehensive algorithm service, allowing partners to leverage a validated intelligent driving system [13][14]. - This model aims to significantly reduce costs and time for partners, enabling them to focus on integration and differentiation while lowering operational expenses by up to 90% [14]. Group 5: Market Accessibility - Horizon Robotics aims to democratize advanced driving assistance systems, targeting the 100,000 RMB market segment, which constitutes a significant portion of the Chinese automotive market [16]. - The company demonstrated that a single Journey 6M chip can effectively handle complex urban driving scenarios, emphasizing cost-effectiveness and adaptability for both electric and traditional fuel vehicles [18][19]. Group 6: Robotics Ecosystem Development - Yu Kai emphasized the importance of mastering autonomous driving as a foundation for future robotics, defining intelligent driving models as the starting point for physical AI [21]. - Horizon Robotics introduced open-source initiatives for its robotics business, including HoloMotion and HoloBrain, aimed at enhancing motion and operational intelligence in robots [20][25]. - HoloMotion has been made available on GitHub, with institutions like Stanford and Tsinghua already utilizing it, indicating a strong interest in developing embodied intelligence [27].
谷歌TPU杀疯了,产能暴涨120%、性能4倍吊打,英伟达还坐得稳吗?
机器之心· 2025-12-09 08:41
Core Viewpoint - Google's TPU is set to disrupt Nvidia's dominance in the AI chip market, with significant production increases and cost advantages for inference tasks [2][4][79]. Group 1: TPU Production and Market Strategy - Morgan Stanley predicts that Google's TPU production will surge to 5 million units by 2027 and 7 million by 2028, a substantial increase from previous estimates of 3 million and 3.2 million units, representing a 67% and 120% upward adjustment respectively [2]. - Google aims to sell TPUs to third-party data centers, complementing its Google Cloud Platform (GCP) business, while still utilizing most TPUs for its own AI training and cloud services [2][3]. Group 2: Comparison with Nvidia's GPU - Nvidia has historically dominated the AI chip market, controlling over 80% of it by 2023, but faces challenges as the market shifts from training to inference, where Google's TPU offers superior efficiency and cost advantages [8][12]. - By 2030, inference is expected to consume 75% of AI computing resources, creating a market worth $255 billion, growing at a CAGR of 19.2% [8][52]. Group 3: Cost and Efficiency Advantages of TPU - Google's TPU is designed for inference, providing a cost per hour of $1.38 compared to Nvidia's H100 at over $2.50, making TPU 45% cheaper [20]. - TPU's performance in inference tasks is four times better per dollar spent compared to Nvidia's offerings, and it consumes 60-65% less power [20][22]. Group 4: Industry Trends and Client Migration - Major AI companies are transitioning from Nvidia GPUs to Google's TPUs to reduce costs significantly; for instance, Midjourney reported a 65% reduction in costs after switching to TPU [34]. - Anthropic has committed to a deal for up to 1 million TPUs, highlighting the growing trend of companies seeking cost-effective solutions for AI workloads [35]. Group 5: Future Implications for Nvidia - Nvidia's profit margins, currently between 70-80%, may face pressure as Google captures even a small portion of the inference workload, potentially leading to over $6 billion in annual profit loss for Nvidia [22][59]. - The shift towards TPUs indicates a broader trend where companies are diversifying their AI infrastructure, reducing reliance on Nvidia's products [67].
没了遥控器,还被扔进荒野,具身智能该「断奶」了
机器之心· 2025-12-09 03:17
Core Viewpoint - The article discusses the challenges faced by humanoid robots in real-world scenarios, emphasizing that their capabilities have been overestimated and that significant advancements are still required for practical applications [11][61]. Group 1: Robot Performance in Real-World Scenarios - Humanoid robots struggle with tasks in outdoor environments, often failing to perform basic functions without remote control [9][11]. - The ATEC 2025 competition highlighted the limitations of robots in navigating complex terrains and performing tasks autonomously, with many relying on remote operation [30][32]. - Successful completion of tasks by some teams demonstrated that traditional methods combined with advanced technology can yield better results than relying solely on large models [26][50]. Group 2: Technical Challenges - Robots face significant difficulties in perception and decision-making, particularly in varying light conditions that affect their sensors [14][21]. - The complexity of physical interactions, such as grasping objects with different textures and colors, poses a challenge for robots due to their lack of tactile feedback [23][56]. - The integration of various computational units (CPU, GPU, NPU) in a compact and efficient manner remains a significant hurdle for robotic systems [52][56]. Group 3: Future Directions and Industry Insights - Experts believe that for robots to be integrated into human environments, they must develop capabilities in mobility, manipulation, and environmental modification [61][66]. - The article suggests that failures in robotic tasks are essential for progress, as they reveal weaknesses that need to be addressed for future advancements [65][66]. - The future of artificial general intelligence (AGI) is expected to involve a deeper integration of machine intelligence with the physical world, moving beyond data recognition to environmental interaction and action execution [66].
Snapchat提出Canvas-to-Image:一张画布集成 ID、姿态与布局
机器之心· 2025-12-09 03:17
Core Viewpoint - Canvas-to-Image is a new framework for compositional image generation that integrates various control signals into a single canvas, simplifying the image generation process by allowing users to provide multiple types of control information simultaneously [2][9][31] Group 1: Traditional Control Limitations - Traditional image generation methods utilize independent input paths for identity reference, pose sketches, and layout boxes, leading to a fragmented and lengthy process [7][8] - Users are unable to overlay multiple control signals in the same area of an image, which restricts the complexity of scene construction [8][9] Group 2: Canvas-to-Image Methodology - The Canvas-to-Image framework consolidates all control signals onto a single canvas, allowing the model to interpret and execute them within the same pixel space [9][10] - The multi-task canvas serves as both the user interface and the model's input, enabling the integration of heterogeneous visual symbols and their spatial relationships [14] Group 3: Training and Inference Process - During training, the model learns from cross-frame image sets, which introduces significant variations in pose, lighting, and expression, preventing it from relying on simple copy mechanisms [15] - In the inference phase, users can flexibly combine multiple control modalities on the same canvas, allowing for complex scene generation without switching between different modules [16] Group 4: Experimental Results - Canvas-to-Image can simultaneously handle identity, pose, and layout box controls, outperforming baseline methods that often fail under similar conditions [18] - The model maintains spatial and semantic relationships between characters and objects, generating scenes with natural interactions and coherence even under complex control settings [20][21] Group 5: Conclusion - The core value of Canvas-to-Image lies in its ability to visualize multi-modal generation controls, making complex scene construction intuitive through direct manipulation on the canvas [31]
全图与切片并非等价?LLaVA-UHD-v3揭示差异推出高效全图建模方案
机器之心· 2025-12-09 03:17
Core Insights - The article discusses the advancements in multimodal large models (MLLMs) and the introduction of LLaVA-UHD v3, which addresses the challenge of efficiently processing high-resolution images while maintaining global understanding capabilities [2][3][10]. Group 1: Introduction of LLaVA-UHD v3 - LLaVA-UHD v3 introduces a new progressive visual compression framework (PVC) that consists of two core components: Refined Patch Embedding (RPE) and Windowed Token Compression (WTC) [4][10]. - The PVC framework significantly reduces the number of visual tokens while preserving global semantic consistency, enhancing the efficiency of native high-resolution visual encoding [4][10]. Group 2: Comparison of Encoding Methods - The research team conducted a fair comparison between slice-based encoding (SBE) and global native-resolution encoding (GNE) using the same model architecture, training data, and evaluation protocols [5]. - GNE demonstrated a notable advantage in spatial perception and localization tasks, with an average improvement of approximately 11.0% over SBE [6]. - In general visual-language understanding tasks, GNE outperformed SBE by about 2.1%, indicating that GNE is more suitable for tasks requiring spatial awareness and high-resolution understanding [7]. Group 3: Efficiency and Performance of LLaVA-UHD v3 - The PVC architecture allows for a significant reduction in computational load while maintaining model capabilities, achieving a 2.4× acceleration compared to MoonViT and 1.9× faster than Qwen2.5-ViT [16]. - LLaVA-UHD v3 was trained on approximately 20 million image-text pairs, which is significantly lower than competitors like Qwen2-VL (700 million) and MiniCPM-V2.6 (460 million), yet it remains highly competitive across various visual-language benchmarks [17]. - The model achieved a visual token compression rate of 64×, surpassing competitors, while still performing comparably or better in tasks requiring fine-grained visual information [17]. Group 4: Future Directions - The article emphasizes the need for further exploration of visual encoding pre-training strategies suitable for multimodal tasks and the gradual introduction of linear complexity operators to replace traditional quadratic complexity attention mechanisms [20].
刚上市的摩尔线程,即将揭晓新一代GPU架构
机器之心· 2025-12-09 03:17
Core Viewpoint - The MUSA Developer Conference (MDC 2025) will be held on December 19-20, 2025, in Beijing, focusing on the development of full-function GPUs and aiming to explore breakthroughs in domestic computing power and the creation of a new autonomous computing ecosystem [2][4]. Group 1: Conference Overview - MDC 2025 is the first domestic conference dedicated to full-function GPUs, emphasizing the themes of creation, connection, and convergence [2]. - The conference aims to gather global developers, technology leaders, and industry pioneers to discuss the self-reliance in technology and industrial upgrades [2]. - The event will showcase the MUSA technology system and its full-stack capabilities, promoting the integration of GPU technology across various industries [2][4]. Group 2: Main Forum Highlights - The main forum will focus on intelligent computing as a core engine for digital transformation across industries, with a keynote by Zhang Jianzhong, founder and CEO of Moole Technology, detailing the full-stack development strategy centered around MUSA [4]. - A new generation GPU architecture will be unveiled, along with a comprehensive layout of product systems, core technologies, and industry solutions [4]. - The forum will also share practical applications and ecological progress in AI computing, graphics rendering, and scientific computing [4]. Group 3: Technical Sessions - Over 20 technical sub-forums will be held, covering key areas such as intelligent computing, graphics computing, AI infrastructure, and developer tools [6]. - The "Moole Academy" will be established to empower developers through systematic technical sharing, resource integration, and talent cultivation [6]. Group 4: Interactive Experience - A 1000 square meter immersive "MUSA Carnival" will be created, featuring diverse thematic exhibition areas that cover cutting-edge technologies and popular application scenarios [8]. - The carnival will include interactive live demonstrations, allowing attendees to experience innovations in AI, digital twins, and more [8][11]. Group 5: Company Vision - Moole Technology aims to provide accelerated computing infrastructure and one-stop solutions to support digital transformation across various industries [26]. - The company aspires to become a leading GPU enterprise with international competitiveness, focusing on the integration of AI and digital twin technologies [26].
ICLR 2026还会好吗?300篇投稿50篇含幻觉,引用example.com竟也能过审
机器之心· 2025-12-08 10:11
机器之心报道 编辑:杜伟、Panda 这届 ICLR 的烦心事还没有结束。 最近一段时间,对于 ICLR 2026 来说,真可谓是一波未平、一波又起。先是第三方机构对审稿意见的系统性统计发现,其中 有 21% 完全由 AI 生成 ;后有 OpenReview 评审大开盒 ,波及到了 ICLR 2026 超过 10000 篇投稿。 今天,ICLR 2026 的审稿又被揭开一块遮羞布。事情是这样的: AI 生成内容检测平台 GPTZero 扫描了 300 篇 投稿论文,发现其中有 50 篇在论文引用上至少包含 一处明显的幻觉内容。 甚至有些幻觉引用还非常离谱,达到了匪夷所思的程度,就好像投稿者完全不检查一样。比如下面 GPTZero CTO 和联创 Alex Cui 在 X 分享的这个例子,投稿者给 出的引用链接竟然是默认示例链接 example.com ! 而在下面的例子中,作者名单就只是一串大写字母。 更令人担忧的是, 这些存在幻觉内容的投稿已经经过了 3-5 名领域专家的同行评审,但他们中的绝大多数都未能识别出这些虚假的引用。 这意味着,如果这些投稿没有其他外部干预,就可能会被 ICLR 会议接收。部分投稿 ...