Workflow
机器之心
icon
Search documents
突破具身智能任务规划边界,刷新具身大脑多榜单SOTA,中兴EmbodiedBrain模型让具身大脑学会「复杂规划」
机器之心· 2025-12-03 08:30
Core Insights - The article discusses the development of the EmbodiedBrain model by ZTE NebulaBrain Team, which aims to address the limitations of current large language models (LLMs) in embodied tasks, focusing on robust spatial perception, efficient task planning, and adaptive execution in real-world environments [2][4]. Group 1: Model Architecture - EmbodiedBrain utilizes a modular encoder-decoder architecture based on Qwen2.5-VL, achieving an integrated loop of perception, reasoning, and action [5]. - The model processes various multimodal inputs, including images, video sequences, and complex language instructions, generating structured outputs for direct control and interaction with embodied environments [8][10]. - Key components include a visual transformer for image processing, a lightweight MLP for visual-language integration, and a decoder that enhances temporal understanding of dynamic scenes [9][10]. Group 2: Data and Training - The model features a structured data architecture designed for embodied intelligence, ensuring alignment between high-level task goals and low-level execution steps [12]. - Training data encompasses four core categories: general multimodal instruction data, spatial reasoning data, task planning data, and video understanding data, with a focus on quality through multi-stage filtering [14][15]. - The training process includes a two-stage rejection sampling method to enhance model perception and reasoning capabilities, followed by a multi-task reinforcement learning approach called Step-GRPO to improve long-sequence task handling [20][21]. Group 3: Evaluation System - EmbodiedBrain establishes a comprehensive evaluation system covering general multimodal capabilities, spatial perception, and end-to-end simulation planning, addressing the limitations of traditional offline assessments [26][27]. - The model demonstrates superior performance in various benchmarks, including MM-IFEval and MMStar, indicating its enhanced multimodal capabilities compared to competitors [28][29]. - In spatial reasoning and task planning evaluations, EmbodiedBrain achieves significant improvements, showcasing its ability to perform complex tasks effectively [30][31]. Group 4: Case Studies and Future Outlook - The model successfully executes tasks involving spatial reasoning and end-to-end execution, demonstrating its capability to generate coherent action sequences based on complex instructions [37][41]. - ZTE plans to open-source the EmbodiedBrain model and its training data, aiming to foster collaboration in the field of embodied intelligence and address existing challenges in data accessibility and evaluation standards [42][43]. - Future developments will focus on multi-agent collaboration and enhancing adaptability across various real-world robotic platforms, pushing the boundaries of embodied intelligence applications [43].
老外傻眼!明用英文提问,DeepSeek依然坚持中文思考
机器之心· 2025-12-03 08:30
Core Insights - DeepSeek has launched two new models, DeepSeek-V3.2 and DeepSeek-V3.2-Speciale, which show significant improvements in reasoning capabilities, with the former being comparable to GPT-5 and the latter performing similarly to Gemini-3.0-Pro [1][4] - There is a notable phenomenon where DeepSeek switches to Chinese during reasoning, even when queries are made in English, leading to discussions about the efficiency of Chinese in processing information [4][6] Group 1: Model Performance - The new models exhibit enhanced reasoning speed, attracting interest from overseas researchers [1] - The comment section reflects a consensus that Chinese characters have a higher information density, requiring fewer characters to express the same meaning compared to English [4][6] Group 2: Cross-Lingual Reasoning - Research indicates that using non-English languages for reasoning can lead to better performance and reduced token consumption, as shown in the paper "EfficientXLang" [7][8] - The study found that reasoning in non-English languages can achieve a token reduction of 20-40% without sacrificing accuracy, with DeepSeek R1 showing reductions from 14.1% (Russian) to 29.9% (Spanish) [11] Group 3: Language Efficiency - Although Chinese can save reasoning token costs compared to English, it is not the most efficient language; Polish ranks highest in long-context tasks [12][14] - The performance of models varies significantly based on the language used for instructions, with English not being the top performer in long-context tasks [14][18] Group 4: Training Data Influence - The prevalence of Chinese training data in domestic models explains the tendency for these models to think in Chinese [20][21] - The phenomenon of models like OpenAI's o1-pro occasionally using Chinese during reasoning raises questions about the influence of training data composition [24][25]
原来这届中国AI年轻人,已经卷到业界都惊了
机器之心· 2025-12-03 04:01
Core Viewpoint - The article discusses a recent advertising algorithm competition organized by Tencent, highlighting the innovative approaches taken by participants to tackle the challenges of recommendation systems, particularly in addressing the "cold start" problem and utilizing generative methods for better user engagement [10][11][15]. Group 1: Competition Overview - The competition lasted over five months, attracting more than 8,000 participants and 2,800 teams, making it a highly competitive technical marathon [22]. - The prize pool for the competition was set at 3.6 million yuan, with the champion team eligible for a 2 million yuan reward [11]. - Participants were provided with desensitized multimodal historical behavior data, which included text, visual, and collaborative behaviors, to make predictions [17][21]. Group 2: Technical Challenges and Innovations - The competition focused on generative advertising recommendation, a new direction in the last few years, which requires participants to explore and innovate due to the lack of existing reference materials [21]. - Many teams attempted to integrate various modalities and address issues such as data noise and missing values, reflecting real-world complexities [21][28]. - Participants showcased innovative solutions, including different generative frameworks and methods for aligning multimodal embeddings, demonstrating a strong understanding of both academic and practical applications [31]. Group 3: Talent Development and Future Prospects - Tencent's Vice President, Jiang Jie, noted a significant improvement in students' understanding of large models and their ability to produce solutions closely aligned with industry needs [29]. - Outstanding participants will be included in Tencent's "Qingyun Plan," which aims to nurture top talent by providing access to resources and mentorship [35]. - The competition highlighted the importance of collaborative learning and the potential for young talents to contribute significantly to the AI field, indicating a promising future for China's AI development [35].
为什么给机器人装上昂贵的触觉传感器,反而让它变笨了?
机器之心· 2025-12-03 04:01
这项工作由伊利诺伊大学香槟分校 (UIUC)、哈佛大学、哥伦比亚大学和麻省理工学院 (MIT) 的合作完成 。 我们的解决方案:组合策略 (Compositional Policies) 为什么特征拼接 (Feature Concatenation)会在机器人感知和决策中失效? 想象一下,你在黑漆漆的背包里找钥匙。你的眼睛此时毫无用处,全靠指尖的触觉,这对你来说轻而易举 ,但在机器人领域,这却是一个非常困难的问题。 残酷的真相: 目前的机器人学习主流的多传感器融合的算法(Feature Concatenation)在处理这种任务时彻底失败了。我们的实验数据显示,当你给机器人加上触 觉数据试图让它更聪明时,它的抓取成功率竟然从 35% 暴跌至 5%!为什么? 因为传统的方法把偶尔出现的关键触觉信号当作了 "噪音" 直接过滤掉了。 当前方法的局限性 目前的多模态机器人学习方法通常使用 特征拼接 (Feature Concatenation) :提取所有传感器的嵌入 (embeddings),将其拼接成一个大向量,然后输入到一个单一 的神经网络策 略中 。 论文标题: Multi-Modal Manipulatio ...
借鉴人脑「海马体-皮层」机制,红熊AI重做了一个「记忆系统」
机器之心· 2025-12-03 04:01
Core Insights - The article emphasizes that memory is becoming a critical breakthrough in the evolution of AI, transitioning from "instant answer tools" to "personalized super assistants" [1][4] - A new machine learning paradigm called "Nested Learning" has been proposed, allowing large language models to learn new skills without forgetting old ones, marking significant progress towards AI that mimics human memory [3][4] Group 1: Shifts in AI Landscape - The focus of large models is shifting from size and speed to memory capabilities and understanding user needs, indicating a new competitive landscape in AI [4][5] - Current large models struggle with long-term memory due to inherent limitations in their architecture, leading to issues like forgetting critical user information during interactions [6][7] Group 2: Memory Mechanisms - Existing models typically have context windows of 8k-32k tokens, which can lead to early information being "pushed out" during long conversations, causing loss of context [6] - The lack of a shared memory mechanism among multiple agents results in "memory islands," where users must repeatedly provide information, diminishing the user experience [7] Group 3: Innovations in Memory - Companies like Google, OpenAI, and Anthropic are focusing on enhancing memory capabilities in AI models, responding to industry demands for long-term, stable, and evolving memory systems [7][10] - Red Bear AI has developed "Memory Bear," a product that addresses the memory limitations of traditional models by implementing a human-like memory architecture [10][11] Group 4: Memory Bear's Architecture - "Memory Bear" utilizes a hierarchical, dynamic memory structure inspired by the human brain's hippocampus and cortex, allowing for efficient memory management [11][13] - The system distinguishes between explicit memory (easily codified information) and implicit memory (subjective understanding), enhancing its ability to recall and utilize user-specific data [15][16] Group 5: Practical Applications and Impact - "Memory Bear" has shown significant improvements in various applications, such as AI customer service, where it creates dynamic memory maps for users, enhancing interaction quality and reducing the need for repetitive information sharing [20][21] - In marketing, "Memory Bear" tracks user behavior to create personalized marketing strategies, moving beyond traditional recommendation systems [22] - The technology has also improved knowledge acquisition efficiency in organizations and personalized education experiences, demonstrating its versatility across sectors [23][24] Group 6: Industry Consensus and Future Directions - The consensus in the industry is that memory capabilities are essential for advancing AI technology and applications, with increasing investments and explorations into human-like memory systems [24]
刚刚,「欧洲的DeepSeek」发布Mistral 3系列模型,全线回归Apache 2.0
机器之心· 2025-12-03 00:06
Core Viewpoint - Mistral AI has launched the Mistral 3 series of open models, which are positioned as high-performance, cost-effective alternatives in the AI model landscape, particularly in response to competition from DeepSeek [2][4][28]. Model Details - The Mistral 3 series includes multiple models: Mistral 3 (14B, 8B, 3B) with base, instruction-tuned, and reasoning versions [5][19]. - Mistral Large 3, a state-of-the-art open model, features a total parameter count of 675 billion and 41 billion active parameters, trained on 3000 NVIDIA H200 GPUs [7][5]. Performance and Benchmarking - Mistral Large 3 ranks second in the OSS non-inference model category on the LMArena leaderboard, indicating it is one of the best-performing open models available [14]. - The model demonstrates strong performance in general prompt tasks and excels in image understanding and multilingual dialogue [7][14]. Collaboration and Optimization - Mistral has partnered with vLLM and Red Hat to enhance accessibility and efficiency for developers using Mistral Large 3, utilizing optimized checkpoints for better performance [17][18]. - The collaboration with NVIDIA focuses on advanced optimization techniques, ensuring that Mistral models leverage high-bandwidth memory for demanding workloads [17][18]. Cost-Effectiveness - Mistral claims that its models offer the best cost-performance ratio among open-source models, with instruction models performing comparably or better than competitors while generating tokens at a significantly lower rate [22][28]. Availability and Customization - Mistral 3 models are available on various platforms including Mistral AI Studio, Amazon Bedrock, and Azure Foundry, among others [25]. - The company also offers custom model training services to organizations seeking tailored AI solutions for specific tasks or environments [27].
句子级溯源+生成式归因,C²-Cite重塑大模型可信度
机器之心· 2025-12-03 00:06
在人工智能快速发展的今天 ,大语言模型已经深入到我们工作和生活的方方面面。然而,如何让AI生成的内容更加可信、可追 溯, 一直是学术界和工业界关注的焦点问题。想象一下 ,当你向ChatGPT提问时,它不仅给出答案,还能像学术论文一样标注每 句话的信息来源——这就是"溯源大语言模型"要解决的核心问题。 北邮百家 AI团队 联合小米大模型团队 提出的 溯源大模型 C²-Cit e,首创上下文感知的归因 生成技术,不仅能让大模型在 生成内容时自动标注精准的信息来源,更能确保生成内容与引用的外部知识高度语义对齐,实现每一 处表述 都有溯源依据 、 与参考来源深度协同,从根本上解决大模型生成内容的可信度问题。 该工作 已被 国际顶级会议 WSDM 2026 收录 。 C²-Cit e 针对现 有 归因模型 存在的关 键缺陷 , 通过引入 "上下文感知"机制, 让引用标记从被动的占位符转变为带有上下 文语义的特殊令牌 , 显著提升了引用质量和 模型 回答准确性 。 论文标题: C²-Cite:Contextual-Aware Citation Generation for Attributed Large Languag ...
这下Altman急了,OpenAI紧急启动「红色警报」
机器之心· 2025-12-02 09:18
Core Viewpoint - OpenAI has declared a "Code Red" status to address competitive pressures, particularly from Google, as it seeks to enhance ChatGPT and maintain its market position [1][6][9]. Group 1: Competitive Landscape - Google has rapidly regained its footing with the Gemini chatbot, increasing its monthly active users from 450 million in July to 650 million in October, posing a significant threat to OpenAI [9]. - Other competitors like Anthropic and xAI are also advancing in various technological directions, intensifying the competitive environment [4][10]. Group 2: OpenAI's Current Challenges - OpenAI's growth rate for ChatGPT has shown signs of slowing, as indicated by CFO Sarah Friar, which raises concerns about sustaining high valuations amid significant cash burn [8]. - The company is seeking approximately $100 billion in new financing to support its extensive cash consumption and ongoing technological development [8]. Group 3: Strategic Priorities - OpenAI is shifting its resource allocation towards core projects, delaying the development of non-essential products, including advertising initiatives [5][6]. - A new reasoning model is set to be released, which is claimed to outperform Google's Gemini 3, aimed at enhancing ChatGPT's capabilities [12]. Group 4: User Engagement and Product Development - OpenAI plans to allow highly customizable interactions for its 800 million weekly active users, striving to position ChatGPT as a true "personal assistant" [13]. - The company aims to optimize model behavior to reduce instances of the AI refusing to answer benign questions and improve its performance in public rankings [13].
迎接「万物皆可RAG」时代:最新综述展示50多种多模态组合的巨大待探索空间
机器之心· 2025-12-02 09:18
Core Insights - The article discusses the emergence of Multimodal Retrieval-Augmented Generation (MM-RAG) as a new field, highlighting its potential applications and the current state of research, which is still in its infancy [2][5][17] - A comprehensive survey published by researchers from Huazhong University of Science and Technology, Fudan University, China Telecom, and the University of Illinois at Chicago covers nearly all possible combinations of modalities for input and output in MM-RAG [4][17] Summary by Sections Overview of MM-RAG - MM-RAG is an evolution of traditional Retrieval-Augmented Generation (RAG) that incorporates multiple modalities such as text, images, audio, video, code, tables, knowledge graphs, and 3D objects [2][4] - Current research primarily focuses on limited combinations of modalities, leaving many potential applications unexplored [2][5] Potential Combinations - The authors identify a vast space of potential input-output modality combinations, revealing that out of 54 proposed combinations, only 18 have existing research [5][6] - Notably, combinations like "text + video as input, generating video as output" remain largely untapped [5] Classification Framework - A new classification framework for MM-RAG is established, systematically organizing existing research and clearly presenting the core technical components of different MM-RAG systems [6][15] - This framework serves as a reference for future research and development in the field [6][15] MM-RAG Workflow - The MM-RAG workflow is divided into four key stages: 1. Pre-retrieval: Organizing data and preparing queries [11] 2. Retrieval: Efficiently finding relevant information from a multimodal knowledge base [12] 3. Augmentation: Integrating retrieved multimodal information into the large model [13] 4. Generation: Producing high-quality multimodal outputs based on input and augmented information [14][15] Practical Guidance - The survey provides a one-stop guide for building MM-RAG systems, covering training, evaluation, and application strategies [17][18] - It discusses training methods to maximize retrieval and generation capabilities, summarizes existing evaluation metrics, and explores potential applications across various fields [18]
AAAI 2026 Oral:明略科技开创稀疏数据「信息瓶颈动态压缩」,精度+速度双SOTA
机器之心· 2025-12-02 06:47
Core Insights - The article discusses the challenges of "Efficient AI," particularly in the context of transformer models becoming larger and more general, while also becoming computationally heavy for edge devices like robots [1][2] - A paper titled "CompTrack," accepted for oral presentation at AAAI 2026, addresses the issue of whether models need to process all input data, showcasing how compression techniques can significantly reduce computational costs while maintaining or even improving model performance [2][14] Redundancy Challenges - Current AI models face "Dual-Redundancy" challenges, which include: 1. Spatial Redundancy: Unrelated background points and blank areas are processed, wasting computational resources and degrading accuracy [3][5] 2. Informational Redundancy: Even in relevant foreground targets, there is a prevalence of redundant and low-value information, which can lead to inefficiencies [5][7] CompTrack Framework - CompTrack proposes an end-to-end framework that addresses both types of redundancy simultaneously [7] - The framework includes: 1. A Spatial Foreground Predictor (SFP) that filters out low-information background noise using information entropy theory [8] 2. An Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module designed to dynamically compress information redundancy in the foreground [10][11] Efficiency and Performance - The IB-DTC module is significant for Efficient AI as it: 1. Is based on the Information Bottleneck principle, retaining only valuable information for predictions [11] 2. Utilizes online Singular Value Decomposition (SVD) for dynamic compression rates based on the intrinsic rank of input data [12] 3. Allows for end-to-end training by using SVD as a guide for optimal compression rates [12] Application and Results - CompTrack has been applied to challenging 3D point cloud tracking tasks, demonstrating that systematic compression of information redundancy is highly effective [14] - The framework not only enhances efficiency but also sets a precedent for addressing information redundancy in various fields, including sensor fusion in robotics and multimodal processing in visual-language models [14][15] - Performance metrics show that CompTrack achieves real-time performance at 80 FPS on RTX 3090, surpassing state-of-the-art methods, with a significant reduction in computational load to 0.94G FLOPs [15]