Workflow
机器之心
icon
Search documents
端到端智驾新SOTA | KnowVal:懂法律道德、有价值观的智能驾驶系统
机器之心· 2026-01-14 07:18
Core Viewpoint - The article discusses the development of KnowVal, an advanced autonomous driving system that integrates perception and knowledge retrieval to enhance visual-language reasoning capabilities, aiming for higher levels of automated driving [4][21]. Group 1: System Overview - KnowVal is a novel autonomous driving system that combines perception modules with knowledge retrieval modules to achieve visual-language reasoning [4]. - The system constructs a comprehensive driving knowledge graph that includes traffic regulations, defensive driving principles, and ethical considerations, supported by an efficient retrieval mechanism based on large language models [4][15]. - KnowVal integrates a world model and a value model within its planner to ensure value-aligned decision-making [4][11]. Group 2: Technical Framework - The system employs an open 3D perception and knowledge retrieval framework, enhancing the traditional visual-language paradigm to facilitate basic visual-language reasoning [7][9]. - It utilizes specialized perception for autonomous driving and open-world 3D perception to extract both common and rare instance features, ensuring effective feature transfer throughout the system [9]. - The knowledge graph retrieval process involves natural language processing of perception data to access relevant knowledge entries, ranked by relevance [10][15]. Group 3: Value Model and Trajectory Planning - KnowVal's trajectory planning is based on a world prediction and value model, iteratively generating candidate trajectories and evaluating them against retrieved knowledge for value assessment [11][19]. - A large-scale driving value preference dataset was created to train the value model, consisting of 160,000 trajectory-knowledge pairs, which were annotated with value scores ranging from -1 to 1 [19]. Group 4: Experimental Results - The KnowVal framework was tested against baseline models GenAD, HENet++, and SimLingo, achieving the lowest collision rate on the nuScenes dataset and the highest driving score and success rate on the Bench2Drive benchmark [21]. - The results indicate that KnowVal outperforms existing end-to-end and visual-language-action models, demonstrating its effectiveness in real-world driving scenarios [21][22]. Group 5: Qualitative Analysis - The article highlights qualitative analysis examples to illustrate KnowVal's performance in adhering to legal and ethical driving behaviors, such as slowing down in wet conditions and obeying lane change regulations in tunnels [23][25].
Sebastian Raschka 2026预测:Transformer统治依旧,但扩散模型正悄然崛起
机器之心· 2026-01-14 07:18
Core Insights - The article discusses the evolving landscape of large language models (LLMs) as of 2026, highlighting a shift from the dominance of the Transformer architecture to a focus on efficiency and hybrid architectures [1][4][5]. Group 1: Transformer Architecture and Efficiency - The Transformer architecture is expected to maintain its status as the foundation of the AI ecosystem for at least the next few years, supported by mature toolchains and optimization strategies [4]. - Recent developments indicate a shift towards hybrid architectures and efficiency improvements, rather than a complete overhaul of existing models [5]. - The industry is increasingly focusing on mixed architectures and efficiency, as demonstrated by models like DeepSeek V3 and R1, which utilize mixture of experts (MoE) and multi-head latent attention (MLA) to reduce inference costs while maintaining large parameter counts [7]. Group 2: Linear and Sparse Attention Mechanisms - The standard Transformer attention mechanism has a complexity of O(N^2), leading to exponential growth in computational costs with increasing context length [9]. - New models like Qwen3-Next and Kimi Linear are adopting hybrid strategies that combine efficient linear layers with full attention layers to balance long-distance dependencies and inference speed [14]. Group 3: Diffusion Language Models - Diffusion language models (DLMs) are gaining attention for their ability to generate tokens quickly and cost-effectively through parallel generation, contrasting with the serial generation of autoregressive models [12]. - Despite their advantages, DLMs face challenges in integrating tool calls within response chains due to their simultaneous generation nature [15]. - Research indicates that DLMs may outperform autoregressive models when high-quality data is scarce, as they can benefit from multiple training epochs without overfitting [24][25]. Group 4: Data Scarcity and Learning Efficiency - The concept of "Crossover" suggests that while autoregressive models learn faster with ample data, DLMs excel when data is limited, achieving significant accuracy on benchmarks with relatively small datasets [27]. - DLMs demonstrate that increased training epochs do not necessarily lead to a decline in downstream task performance, offering a potential solution in an era of data scarcity [28].
继宇树后,唯一获得三家大厂押注的自变量:具身模型不是把DeepSeek塞进机器人
机器之心· 2026-01-14 07:18
Core Viewpoint - The article discusses the evolution of embodied intelligence, emphasizing that the next battleground will be the "brain" of robots, which is crucial for their autonomous operation in the physical world [1][4]. Group 1: Investment and Development - The company Zivariable has recently raised $1 billion in funding from ByteDance and Sequoia, indicating strong investor interest in their approach to robotic intelligence [1]. - Zivariable's focus is on developing a foundational model for physical intelligence that operates independently of existing AI models, aiming for a paradigm shift in how robots interact with the physical world [7][12]. Group 2: Challenges in Embodied Intelligence - The complexity of physical tasks requires robots to have a brain supported by a physical world foundational model, which is distinct from merely applying existing AI models [1][4]. - Current AI models struggle with understanding subtle physical differences that only become apparent through real-world interaction, highlighting the need for a model that can process long sequences of actions and understand causality over time [6][7]. Group 3: Model Development Approach - Zivariable advocates for an end-to-end architecture that allows for a holistic understanding of physical interactions, contrasting with the modular approach that often leads to a loss of critical details [9][10]. - The company emphasizes the importance of a general-purpose model that can learn the common structures of the physical world, similar to how language models have evolved [11]. Group 4: Unique Characteristics of Zivariable - Zivariable is committed to self-research, particularly in foundational models, believing that the next phase of competition in embodied intelligence will revolve around the ability to construct data loops and evolve models [15][16]. - The company has developed two core models, WALL-A and WALL-OSS, which integrate various aspects of embodied intelligence and have been successfully deployed in real-world scenarios [16][13]. Group 5: The Path Forward - The construction of a physical world foundational model is likened to retracing the developmental path of human infants, as it involves learning complex physical interactions that are not easily articulated [22]. - Zivariable's journey in this domain is characterized as long and challenging but ultimately rewarding, as they aim to redefine the capabilities of robots in the physical world [23].
仅用10天?Anthropic最新智能体Cowork的代码竟然都是Claude写的
机器之心· 2026-01-14 05:37
编辑|冷猫、+0 最近, Anthropic 发布了全新的智能体工具 Cowork ,号称能让普通用户像开发者使用 Claude Code 一样,轻松搞定非技术性任务。 更令人咋舌的是, Cowork 的诞生仅仅用了一周半。 根据官方介绍,Cowork 的能力包括但不限于:自动整理下载文件夹、从截图生成电子表格、基于散乱笔记起草报告,甚至支持连接 Google Calendar 等 现有工具,直接生成文档或演示文稿。 据 Claude Code 创建者 Boris Cherny 所说, Cowork 的全部代码都是由 Claude Code 写的。 这简直就是 Claude Code 最好的广告,当其他 AI 公司还在靠收购构建生态的是时候,Anthropic 已经开始让 AI 自己生 AI 了。 Cowork 是 Claude Code 的简化版本,专为普通用户设计。目前作为研究预览版,仅向 macOS 桌面端的 Claude Max 订阅者开放。用户只需授权访问 特定文件夹,便能通过自然语言指令,让 AI 自主读取、编辑或创建文件。它不仅能制定计划、并行执行任务,还会实时更新进度,并邀请用户参与指导。 有 ...
AAAI 2026|AP2O-Coder 让大模型拥有「错题本」,像人类一样按题型高效刷题
机器之心· 2026-01-14 05:37
Core Insights - The article discusses the development of the Adaptive Progressive Preference Optimization (AP2O) method and its framework, AP2O-Coder, aimed at improving code generation and error correction in large language models (LLMs) [3][5][6]. Group 1: Existing Challenges and AP2O-Coder Design - Current offline preference optimization methods face three main challenges: lack of error type awareness, insufficient training focus, and weak dynamic adaptation capabilities [5][12]. - AP2O-Coder is designed to address these challenges by utilizing a systematic learning process similar to human error correction strategies, which includes error analysis and targeted optimization [6][8]. Group 2: AP2O-Coder Framework and Mechanism - The AP2O-Coder framework consists of four key steps: code generation evaluation, error diagnosis analysis, progressive preference optimization, and adaptive error replay [10][11][14]. - The code generation evaluation step establishes an initial training dataset by generating candidate answers for programming tasks and labeling them as pass or fail [10]. - The error diagnosis analysis step uses programming language-specific tools to identify and categorize errors, creating a structured "error book" for targeted optimization [11]. - The progressive preference optimization step focuses on correcting errors in a structured manner, prioritizing error types based on model size [13]. - The adaptive error replay step regularly evaluates model performance and adjusts training data distribution to focus on current weaknesses [14]. Group 3: Experimental Validation and Results - The research team conducted systematic validation on six mainstream LLMs, achieving performance improvements of 2.8% to 3.4% on the EvalPlus benchmark, even for large models [16][18]. - AP2O-Coder demonstrated a significant reduction in error occurrence rates and improved generalization capabilities across various models [22][29]. - The method also showed enhanced sample efficiency, requiring only 4% to 60% of the preference data compared to traditional methods to achieve optimal performance [25]. Group 4: Adaptability of General LLMs - AP2O-Coder is effective not only for code-specific LLMs but also for adapting general LLMs to coding tasks, as evidenced by significant performance improvements in models like Qwen3 and Llama3 [28].
500万次围观,1X把「世界模型」真正用在了机器人NEO身上
机器之心· 2026-01-14 01:39
Core Viewpoint - The article discusses the advancements in the home humanoid robot NEO, particularly the introduction of its new brain, the 1X World Model, which enables NEO to learn and perform tasks more autonomously by understanding the physical world through video training [3][4][11]. Group 1: Technological Advancements - NEO has evolved from merely executing pre-programmed actions to being able to "imagine" tasks by generating a video of successful task completion in its mind before executing it [4][6]. - The 1X World Model (1XWM) integrates video pre-training to allow NEO to generalize across new objects, movements, and tasks without extensive prior training data [11][21]. - The model is built on a 14 billion parameter generative video model, which has undergone a multi-stage training process to adapt to NEO's physical characteristics [16][18]. Group 2: Training and Evaluation - The training process includes using 900 hours of first-person human video data to align the model with human-like operational behaviors, followed by fine-tuning with 70 hours of robot data [18][19]. - The evaluation of 1XWM's capabilities shows that it can perform tasks it has never encountered before, with generated videos closely matching real-world execution [24][30]. - The importance of high-quality subtitles and first-person data in improving video generation quality and task success rates is emphasized, indicating that detailed descriptions enhance the model's performance [39][40]. Group 3: Practical Applications - NEO has been tested on various tasks, including those requiring complex interactions and coordination, demonstrating its ability to adapt and learn from video pre-training [28][30]. - The model's performance in both in-distribution and out-of-distribution tasks shows a stable success rate, although some fine manipulation tasks remain challenging [30][32]. - The article suggests that the quality of generated videos can be linked to task success rates, allowing for potential improvements in video generation through iterative testing and selection processes [32][39].
跳出「黑盒」,人大刘勇团队最新大语言模型理论与机理综述
机器之心· 2026-01-14 01:39
Core Insights - The article discusses the rapid growth of Large Language Models (LLMs) and the paradigm shift in artificial intelligence, highlighting the paradox of their practical success versus theoretical understanding [2][5][6] - A unified lifecycle-based classification method is proposed to integrate LLM theoretical research into six stages: Data Preparation, Model Preparation, Training, Alignment, Inference, and Evaluation [2][7][10] Group 1: Lifecycle Stages - **Data Preparation Stage**: Focuses on optimizing data utilization, quantifying data features' impact on model capabilities, and analyzing data mixing strategies, deduplication, and the relationship between memorization and model performance [11][18] - **Model Preparation Stage**: Evaluates architectural capabilities theoretically, understanding the limits of Transformer structures, and designing new architectures from an optimization perspective [11][21] - **Training Stage**: Investigates how simple learning objectives can lead to complex emergent capabilities, analyzing the essence of Scaling Laws and the benefits of pre-training [11][24] Group 2: Advanced Theoretical Insights - **Alignment Stage**: Explores the mathematical feasibility of robust alignment, analyzing the dynamics of Reinforcement Learning from Human Feedback (RLHF) and the challenges of achieving "Superalignment" [11][27] - **Inference Stage**: Decodes how frozen-weight models simulate learning during testing, analyzing prompt engineering and context learning mechanisms [11][30] - **Evaluation Stage**: Theoretically defines and measures complex human values, discussing the effectiveness of benchmark tests and the reliability of LLM-as-a-Judge [11][33] Group 3: Challenges and Future Directions - The article identifies frontier challenges such as the mathematical boundaries of safety guarantees, the implications of synthetic data, and the risks associated with data pollution [11][18][24] - It emphasizes the need for a structured roadmap to transition LLM research from engineering heuristics to rigorous scientific discipline, addressing the theoretical gaps that remain [2][35]
视觉模型既懂语义,又能还原细节,南洋理工&商汤提出棱镜假说
机器之心· 2026-01-13 10:04
背景:为什么 "懂语义" 和 "还原细节" 总是很难兼得? 作者来自 Nanyang Technological University(MMLab) 与 SenseTime Research,提出 Prism Hypothesis(棱镜假说) 与 Unified Autoencoding(UAE),尝试用 "频 率谱" 的统一视角,把语义编码器与像素编码器的表示冲突真正 "合并解决"。 在视觉基础模型里,我们经常同时依赖两类能力: 但现实问题是:很多系统被迫把两套表示 "拼在一起用":语义一套、像素一套,训练效率下降、表示互相干扰、而且很难得到一个既 "语义强" 又 "细节强" 的统一 潜空间。 论文把这种矛盾归结为一个更本质的问题:世界的信息到底如何被表示,才能既共享语义,又保留各自模态的细粒度。 核心洞察:Prism Hypothesis(棱镜假说) 论文标题: The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding 代码仓库:https://github.com/Weich ...
相约AAAI 2026 | 上海AI实验室北极星 X 星启交流会(报名开启)
机器之心· 2026-01-13 10:04
2026年1月20日至27日,第40届AAAI人工智能会议( AAAI 2026 ) 将在新加 坡召开。期间,上海人工智能实验室(上海AI实验室)将举办" 北极星X星启交流 会暨云帆AI Talent Meetup "。届时,实验室相关领域专家将亲临现场,与全球 同行展开深度交流与研讨。 诚邀AAAI论文作者,人工智能、自然科学、工程科学等多学科交叉领域的教授、 博士后,以及产学研各界的创新实践者参加,共探探前沿技术。截至目前,"北 极星"交流会已在中国、美国、新加坡、加拿大等地成功举办多场,为数千名AI英 才连接全球机遇。 1 报名信息 交流会为邀约制 ,请扫描下方二维码或点击文末阅读原文, 提交报名信息。审 核通过后,将收到邀请函邮件。席位有限,请即报名。 截止时间: 1月19日12:00p.m. 咨询邮箱 : luochen @pjlab.org.cn 2 活动信息 活动亮点: 活动时间 : 1月22日17:30-20:30(新加坡时间) 活动地点 : 新加坡中心城区 议程速览 顶尖学术分享 : 上海AI实验科学家 创新成果分享、前沿技术主题演讲 ,激发科研 灵感。 实验室直通车: 与上海AI实验室团队 ...
不上云、不租卡,如何优雅地在本地微调Qwen-VL-30B?
机器之心· 2026-01-13 04:08
Core Viewpoint - The article discusses the challenges and solutions for deploying a 30B parameter multimodal AI model, emphasizing the need for a powerful yet compact computing solution that balances memory and processing capabilities [1][12][51]. Model Selection - A 30B parameter model is identified as the optimal choice for understanding complex data, outperforming smaller models while being more manageable than larger ones [2][3]. - The article highlights the deceptive nature of the "30B parameter" label, noting that high-resolution image processing significantly increases memory requirements [4][6]. Hardware Requirements - The need for substantial memory is emphasized, with 24GB of VRAM being insufficient for fine-tuning a 30B model, leading to potential performance sacrifices [10][12]. - The Lenovo ThinkStation PGX is introduced as a compact solution with 128GB of unified memory, allowing for efficient processing without the constraints of traditional setups [19][21]. Performance and Efficiency - The ThinkStation PGX's architecture allows for shared memory between CPU and GPU, enabling developers to run large models without running out of memory [25][26]. - The article details the successful fine-tuning of a model, achieving a significant reduction in loss from 4.03 to 1.06, demonstrating the system's effectiveness [34]. Advantages of Lenovo ThinkStation PGX - The PGX is positioned as the only desktop solution capable of comfortably running 30B multimodal models, providing a unique advantage in the market [38]. - The system's design incorporates advanced cooling solutions to manage high power consumption effectively, ensuring stable performance during extended tasks [41]. Market Position and Pricing - Lenovo's ThinkStation PGX is priced at 31,999 yuan for the 1TB version and 36,999 yuan for the 4TB version, offering a cost-effective alternative to high-end GPUs or cloud instances [51][52]. - The article suggests that for developers facing memory constraints, the PGX represents a valuable investment, providing a seamless experience without the typical configuration headaches [52][53].