思维链推理
Search documents
百度X-Driver:可闭环评测的VLA
自动驾驶之心· 2025-12-28 03:30
>>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 作者 | AIming 编辑 | 自动驾驶之心 原文链接: https://zhuanlan.zhihu.com/p/1907444302092698547 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 VLA01 02系列中EMMA OpenEMMA都没有在闭环的场景下验证,其实很关键,因为开环和闭环评测根本不是一回事,开环的指标也并不靠谱,这个志琦大佬的文章 很早就讨论的这个问题: 那么前段时间,哈工大和百度的X-Driver:Explainable Autonomous Driving with Vision-Language Models 终于有闭环评测指标了,闭环因为要实际控车,所以这种闭环 指标才是衡量一个端到端方案的性能的更合理方案。今天继续来学习,看看闭环怎么做~ X-Driver Motivation 目前基于 MLLM 的框架难以进行闭环评估,在现实世界的驾驶场景中存在幻觉和缺乏稳定轨迹输出,现有的方案在闭环评估中的成功率仍然很低,因此把怎么把 ...
博世最新一篇长达41页的自动驾驶轨迹规划综述
自动驾驶之心· 2025-12-05 00:03
Core Insights - The article discusses the advancements and applications of foundation models (FMs) in trajectory planning for autonomous driving, highlighting their potential to enhance understanding and decision-making in complex driving scenarios [4][5][11]. Background Overview - Foundation models are large-scale models that learn representations from vast amounts of data, applicable to various downstream tasks, including language and vision [4]. - The study emphasizes the importance of FMs in the autonomous driving sector, particularly in trajectory planning, which is deemed the core task of driving [8][11]. Research Contributions - A classification system for methods utilizing FMs in autonomous driving trajectory planning is proposed, analyzing 37 existing methods to provide a structured understanding of the field [11][12]. - The research evaluates the performance of these methods in terms of code and data openness, offering practical references for reproducibility and reusability [12]. Methodological Insights - The article categorizes methods into two main types: FMs customized for trajectory planning and FMs that guide trajectory planning [16][19]. - Customized FMs leverage pre-trained models, adapting them for specific driving tasks, while guiding FMs enhance existing trajectory planning models through knowledge transfer [19][20]. Application of Foundation Models - FMs can enhance trajectory planning capabilities through various approaches, including fine-tuning existing models, utilizing chain-of-thought reasoning, and enabling language and action interactions [9][19]. - The study identifies 22 methods focused on customizing FMs for trajectory planning, detailing their functionalities and the importance of prompt design in model performance [20][32]. Challenges and Future Directions - The article outlines key challenges in deploying FMs in autonomous driving, such as reasoning costs, model size, and the need for suitable datasets for fine-tuning [5][12]. - Future research directions include addressing the efficiency, robustness, and transferability of models from simulation to real-world applications [12][14]. Comparative Analysis - The study contrasts its findings with existing literature, noting that while previous reviews cover various aspects of autonomous driving, this research specifically focuses on the application of FMs in trajectory planning [13][14]. Data and Model Design - The article discusses the importance of data curation for training FMs, emphasizing the need for structured datasets that include sensor data and trajectory pairs [24][28]. - It also highlights different model design strategies, including the use of existing visual language models and the combination of visual encoders with large language models [27][29]. Language and Action Interaction - The research explores models that incorporate language interaction capabilities, detailing how these models utilize visual question-answering datasets to enhance driving performance [38][39]. - It emphasizes the significance of training datasets and evaluation metrics in assessing the effectiveness of language interaction in trajectory planning [39][41].
超越ORION!CoT4AD:显式思维链推理VLA模型(北大最新)
自动驾驶之心· 2025-12-02 00:03
Core Insights - The article introduces CoT4AD, a new Vision-Language-Action (VLA) framework designed to enhance logical and causal reasoning capabilities in autonomous driving scenarios, addressing limitations in existing VLA models [1][3][10]. Background Review - Autonomous driving is a key research area in AI and robotics, promising improvements in traffic safety and efficiency, and playing a crucial role in smart city and intelligent transportation system development [2]. - Traditional modular architectures in autonomous driving face challenges such as error accumulation and limited generalization, leading to the emergence of end-to-end paradigms that utilize unified learning frameworks [2][3]. CoT4AD Framework - CoT4AD integrates chain-of-thought reasoning into end-to-end autonomous driving, allowing for explicit or implicit reasoning through a series of downstream tasks tailored for driving scenarios [3][10]. - The framework combines perception, language reasoning, future prediction, and trajectory planning, enabling the generation of explicit reasoning steps [6][10]. Experimental Results - CoT4AD was evaluated on the nuScenes and Bench2Drive datasets, achieving state-of-the-art performance in both open-loop and closed-loop assessments, outperforming existing LLM-based and end-to-end methods [10][19]. - In the nuScenes dataset, CoT4AD achieved L2 distance errors of 0.12m, 0.24m, and 0.53m at 1s, 2s, and 3s respectively, with an average collision rate of 0.10% [17][18]. Contributions of CoT4AD - The model's design allows for robust multi-task processing and future trajectory prediction, leveraging a diffusion model integrated with chain-of-thought reasoning [10][12]. - CoT4AD demonstrates superior performance in complex driving scenarios, enhancing decision-making consistency and reliability across diverse environments [19][23]. Ablation Studies - The effectiveness of various components, such as perception tokenizers and the chain-of-thought design, was validated through ablation studies, showing significant performance improvements when these elements were included [26][28]. - The model's ability to predict future scenarios was found to be crucial, with optimal performance achieved when predicting four future scenarios [29]. Conclusion - CoT4AD represents a significant advancement in autonomous driving technology, demonstrating enhanced reasoning capabilities and superior performance compared to existing methods, while also highlighting areas for future research to improve computational efficiency [30][32].
北京大学最新!MobileVLA-R1:机械臂之外,移动机器人的VLA能力怎么样了?
具身智能之心· 2025-11-30 03:03
Core Insights - The article discusses the introduction of MobileVLA-R1, a new framework for quadruped robots that bridges the gap between high-level semantic reasoning and low-level action control, addressing the challenges of stability and interpretability in existing methods [1][2][21]. Group 1: Need for Reconstruction of VLA Framework - Current quadruped robots face two main challenges: a semantic-control gap leading to instability in command execution and a lack of traceable reasoning that complicates error diagnosis [2]. - MobileVLA-R1's breakthrough lies in decoupling reasoning from action execution, allowing robots to "think clearly" before "acting accurately," enhancing both interpretability and control robustness [2][23]. Group 2: Implementation of MobileVLA-R1 - MobileVLA-R1 employs a structured CoT dataset, a two-stage training paradigm, and multi-modal perception fusion to achieve coherent reasoning, stable control, and strong generalization [4][6]. - The structured CoT dataset includes 18K episode-level samples, 78K step-level samples, and 38K navigation-specific samples, filling the gap in reasoning supervision from instruction to action [4][5]. Group 3: Performance Evaluation - In navigation tasks, MobileVLA-R1 achieved a success rate of 68.3% and 71.5% on R2R-CE and RxR-CE datasets, respectively, outperforming existing methods by an average of 5% [10]. - For quadruped control tasks, it achieved an average success rate of 73% across six locomotion and operation tasks, significantly surpassing baseline models [12][13]. Group 4: Real-World Deployment - MobileVLA-R1 was tested on the Unitree Go2 quadruped robot in various environments, demonstrating robust adaptation to complex scenarios with a success rate of 86%-91% for complex instructions [14][18]. - The integration of depth and point cloud encoders improved navigation success rates by 5.8%, highlighting the importance of 3D spatial information for scene understanding [19][20]. Group 5: Key Conclusions and Future Directions - MobileVLA-R1 innovatively integrates chain-of-thought reasoning with reinforcement learning, addressing the industry's dilemma of either interpretability or execution stability [21][23]. - Future directions include expanding the action space for more precise tasks, reducing reasoning latency through model optimization, and enhancing self-supervised learning to decrease reliance on labeled data [23].
AI越会思考,越容易被骗?「思维链劫持」攻击成功率超过90%
机器之心· 2025-11-03 08:45
Core Insights - The article discusses a new attack method called Chain-of-Thought Hijacking, which exploits the reasoning capabilities of AI models to bypass their safety mechanisms [1][2][5]. Group 1: Attack Mechanism - Chain-of-Thought Hijacking involves inserting a lengthy harmless reasoning sequence before a harmful request, effectively diluting the model's refusal signals and allowing harmful instructions to slip through [2][5]. - The attack has shown high success rates on various models, including Gemini 2.5 Pro (99%), GPT o4 mini (94%), Grok 3 mini (100%), and Claude 4 Sonnet (94%) [2][11]. Group 2: Experimental Setup - The research utilized the HarmBench benchmark to evaluate the effectiveness of the attack against several reasoning models, comparing it to baseline methods like Mousetrap, H-CoT, and AutoRAN [11][15]. - The team implemented an automated process using a supporting LLM to generate candidate reasoning prefaces and integrate harmful content, optimizing the prompts without accessing the model's internal parameters [6][7]. Group 3: Findings and Implications - The results indicate that while Chain-of-Thought reasoning can enhance model accuracy, it also introduces new security vulnerabilities, challenging the assumption that more reasoning leads to greater robustness [26]. - The study suggests that existing defenses are limited and may need to embed security within the reasoning process itself, such as monitoring refusal activations across layers or ensuring attention to potentially harmful text spans [26].
AI能否「圣地巡礼」?多模态大模型全新评估基准VIR-Bench来了
机器之心· 2025-10-15 04:08
Core Insights - The article discusses the development of a new multimodal large model evaluation benchmark called VIR-Bench, aimed at assessing AI's ability to understand travel videos in terms of geographical locations and temporal sequences [4][20] - The research emphasizes the importance of reconstructing travel itineraries from videos, which requires models to comprehend both geographic and temporal relationships [20] Group 1: VIR-Bench Overview - VIR-Bench is designed to evaluate AI's understanding of travel vlogs by generating a visiting order graph that represents the sequence and relationships of visited locations [6][9] - The visiting order graph consists of nodes representing locations categorized into three levels: Prefecture, City, and Point of Interest (POI) [7][9] Group 2: Task Design and Dataset - The task is divided into two sub-tasks: node prediction, where the model identifies all visited locations, and edge prediction, where it determines the relationships between these locations [10][11] - A dataset of 200 travel videos was constructed, covering 3,689 POIs across 43 prefectures in Japan, with detailed annotations for each video [17][13] Group 3: Experimental Results and Challenges - Current models, particularly open-source ones, lag behind commercial models in POI node recognition and transition edge prediction, with transition edge prediction being notably challenging [16][18] - The performance of models improves significantly with increased scale and the inclusion of geographic pre-training, highlighting the importance of these factors in enhancing accuracy [16][18] Group 4: Future Directions - The research indicates that while current models struggle with long-range reasoning and temporal understanding, there are clear pathways for improvement, such as enhancing spatial awareness and integrating multimodal information [20][18] - The ultimate goal is for AI to not only analyze videos but also to possess the capability to act within the world, aligning with applications in robotics and autonomous systems [20][18]
ICCV 2025|UV-CoT:无监督视觉推理新突破,偏好优化重塑图像级思维链
机器之心· 2025-07-28 04:24
Core Viewpoint - The article discusses the introduction of a novel unsupervised visual reasoning framework called UV-CoT, which enhances model reasoning capabilities and interpretability in visual understanding tasks by leveraging a chain-of-thought (CoT) approach [2][3][25]. Group 1: Background and Challenges - Existing models rely on supervised fine-tuning (SFT) strategies that require extensive labeled data, leading to high annotation costs and limited scalability [6][7]. - SFT methods face challenges such as high labor costs for annotating key image regions and reasoning paths, and limited generalization capabilities due to reliance on a single type of training signal [7]. Group 2: UV-CoT Framework - UV-CoT is designed to mimic human visual understanding by focusing on "key regions → reasoning process," employing an unsupervised data generation and preference optimization mechanism [4][3]. - The framework utilizes an automated preference data generation and evaluation process, guided by an improved preference optimization algorithm called Score-DPO (sDPO), to achieve unsupervised image-level chain-of-thought learning [8][11]. Group 3: Methodology - UV-CoT generates diverse intermediate reasoning responses for image-question pairs using a target model and an evaluation model, which assesses the selected regions' scores and their impact on subsequent answers [13]. - The preference data set is constructed by randomly selecting preference pairs from the generated responses, retaining the highest-scoring response chains for further reasoning [14]. Group 4: Performance and Results - UV-CoT demonstrates significant performance improvements over existing supervised chain-of-thought models, outperforming models like Visual-CoT-7B and LLaVA-1.5-7B across six benchmarks [20][22]. - The self-evaluation capability of UV-CoT leads to high-quality bounding box generation, surpassing LLaVA-1.5-7B by 4.8% and closely approaching the performance of the 12B model OmniLMM-12B [23]. Group 5: Conclusion - UV-CoT presents an innovative approach to unsupervised visual reasoning, eliminating the dependency on manual annotations and enabling automatic identification and reasoning optimization of key image regions, thus laying a solid foundation for future research in unsupervised visual understanding [25].
3D芯片堆叠,新方法
半导体行业观察· 2025-07-01 01:03
Core Viewpoint - The next significant leap in semiconductor packaging will require a series of new technologies, processes, and materials that will collectively achieve an order-of-magnitude performance improvement, which is crucial for the AI era [1]. Group 1: Advances in Cooling Technologies - Liquid cooling technology at the chip level is emerging as forced air cooling reaches its limits, with up to 40% of power used for current delivery and heat dissipation [4]. - TSMC's silicon integrated micro-cooler (IMEC-Si) is being tested for reliability, designed to handle over 3,000 watts of uniform power dissipation under specific conditions [6]. - The demand for direct liquid cooling is increasing, with innovative concepts like using chips as coolants being proposed [7]. Group 2: Hybrid Bonding and Interconnects - Hybrid bonding with fine-pitch multilayer redistribution layers (RDL) is gaining attention as a cost-effective solution for high-speed interconnects [14]. - Intel's hybrid bonding can achieve spacing as small as 1µm, which is critical for advanced applications [5][17]. - The transition from traditional dielectric materials to polymer/copper hybrid bonding is being explored to enhance performance [16]. Group 3: Backside Power Delivery - Backside power delivery significantly reduces voltage drop related to transistor power supply, but it also exacerbates heat issues [19]. - IBM has developed an anisotropic model for precise heat transfer calculations in backend stacks, emphasizing the importance of thermal considerations in design [21]. - The implementation of backside power delivery is expected to lead to a 10% to 30% reduction in thermal losses [23]. Group 4: Co-Packaged Optical Devices - The demand for faster data networks is driving the integration of optical engines with GPUs and HBM in a single package, significantly increasing data transmission speeds [26]. - Co-packaged optical devices (CPO) are expected to achieve a 32-fold increase in bandwidth by bringing optical engines closer to processors [26]. - However, challenges remain regarding thermal management and warpage sensitivity in CPO implementations [28].
8个数据集全面胜出!思维链推理刷新图学习表现上限
量子位· 2025-06-08 03:40
GCoT团队 投稿 量子位 | 公众号 QbitAI 图神经网络还能更聪明?思维链提示学习来了! 由于图数据拥有复杂的非线性结构和缺少文本信息,语言模型中的思维链(Chain-of-Thought,CoT)提示 学习方法难以简单直接地应用于图数据。 基于此,来自新加坡管理大学和中国科学技术大学的研究者们提出了 GCo T ——首个应用于无文本图数据 的类思维链提示学习框架。 实验结果表明,GCoT在八个图数据集上的少样本节点分类与图分类任务全面超越现有SOTA方法,尤其在 1-5样本的极少样本设置下表现最为显著。 GCoT方法解析 GCoT的核心思想是将下游的推断过程拆分为多个推断步骤。具体包含: 研究 者们在八个公开 数据集上进行了全面实验以评估和分析GCoT。 整体框架 研究者们将思维链提示学习分为三个部分: 2. 思维构建 为有效利用多层结构信息,研究人员将每一层的嵌入表示做加权求和得到融合后的"思维" 。 3. 基于思维的提示学习Thought conditioned prompt learning 研究人员设计的"思维" 捕获了图中节点的结构知识并用于指导下一步推断。由于每个节点可能具有不同 的特质 ...
海天瑞声20250605
2025-06-06 02:37
Summary of Haitai Ruisheng Conference Call Company Overview - **Company**: Haitai Ruisheng - **Industry**: AI and Data Processing Key Financial Performance - In 2024, Haitai Ruisheng achieved a net profit of 11.34 million yuan, turning around from a loss, with operating cash flow of 28.73 million yuan, driven by increased multimodal data orders and improved gross margins on high-margin products and customized services [2][3][4] - Total revenue for 2024 reached 237 million yuan, a year-on-year increase of 39.45%, with a gross margin of 66.46%, up by 10.45 percentage points [3][4] - The company reported a significant improvement in net profit, up by 41.72 million yuan compared to the previous year [3] Strategic Initiatives - The company is actively expanding its overseas market presence, particularly in the smart driving sector, aligning with automotive companies' international expansion trends [2][5] - Haitai Ruisheng is focusing on R&D investments in smart driving data processing platforms and intelligent data operation platforms, achieving significant advancements in algorithm reserves and inference frameworks [2][6] Technological Innovations - The company has established a technology-led strategy, emphasizing R&D to overcome technical bottlenecks and enhance the production of training data [2][7] - Innovations in smart driving annotation include multi-frame point cloud overlay and object tracking algorithms, which improve annotation efficiency and transition towards 4D annotation [2][8] - The company has developed a self-research SLAM algorithm to optimize parking scene 4D point cloud annotation, addressing complex 3D environments [8][9] Voice Recognition and Natural Language Processing - In collaboration with Tsinghua University, Haitai Ruisheng launched the Dolphin training project to improve ASR accuracy for Eastern languages, processing 212,000 hours of high-quality data covering 40 Eastern languages and 22 Chinese dialects [3][10] - The company has introduced over 150 new training data products, with a total of 1,716 proprietary products, and expanded its offerings to include 11 new languages in the smart voice sector [10] Future Plans - For 2025, the company aims to continue driving growth through technology and product innovation, focusing on building an intelligent data management platform and developing automated data processing algorithms [12] - The company plans to expand its multimodal data product matrix and explore new areas such as embodied intelligence and vertical industry applications [12] Market Positioning - Haitai Ruisheng is positioning itself to support national digital economy strategies by collaborating with local governments and educational institutions to enhance data governance and talent development [13] - The company is also expanding its resource network in finance, healthcare, and manufacturing sectors to improve data service capabilities [12][13] Q1 2025 Financial Performance - In Q1 2025, the company reported revenue of 69.81 million yuan, a 72% year-on-year increase, with a gross margin of 47.41% and a net profit of 370,000 yuan, marking a 101 million yuan increase compared to the previous year [14]