视觉语言模型
Search documents
智源&港科大等出品!RoboMirror:让机器人先 “读懂” 视频,再精准复刻每一个动作
具身智能之心· 2026-01-09 00:55
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Zhe Li等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 ★ 本文的主要作者来自北京智源人工智能研究院、香港科技大学、哈尔滨工业大学、上海交通大学、北京大学和悉尼大学。本文的第一作者为北京智源人工智能 研究院的实习生李哲,主要研究方向为具身智能和3D数字人。共同一作是香港科技大学的硕士生朱泊安。本文的通讯作者为北京大学计算机学院研究员、助理 教授仉尚航和北京智源研究院研究员迟程。 ★ 行业痛点:音频与关节驱动的弱耦合 想象这样两个场景:你戴着 GoPro 记录下自己拖地、运球的第一视角视频,远在另一端的人形机器人就能同步复刻动作,仿佛你亲自在场操作;打开一段第三人称 视频,机器人无需等待复杂的姿态解析,直接理解视频中奔跑、交替出拳的意图,流畅完成模仿。 这不是未来科幻,而是 RoboMirror 正在实现的 "先理解、后模 仿" 的具身智能新范式。 长久以来,机器人模仿人类动作始终 ...
博世最新一篇长达41页的自动驾驶轨迹规划综述
自动驾驶之心· 2025-12-05 00:03
Core Insights - The article discusses the advancements and applications of foundation models (FMs) in trajectory planning for autonomous driving, highlighting their potential to enhance understanding and decision-making in complex driving scenarios [4][5][11]. Background Overview - Foundation models are large-scale models that learn representations from vast amounts of data, applicable to various downstream tasks, including language and vision [4]. - The study emphasizes the importance of FMs in the autonomous driving sector, particularly in trajectory planning, which is deemed the core task of driving [8][11]. Research Contributions - A classification system for methods utilizing FMs in autonomous driving trajectory planning is proposed, analyzing 37 existing methods to provide a structured understanding of the field [11][12]. - The research evaluates the performance of these methods in terms of code and data openness, offering practical references for reproducibility and reusability [12]. Methodological Insights - The article categorizes methods into two main types: FMs customized for trajectory planning and FMs that guide trajectory planning [16][19]. - Customized FMs leverage pre-trained models, adapting them for specific driving tasks, while guiding FMs enhance existing trajectory planning models through knowledge transfer [19][20]. Application of Foundation Models - FMs can enhance trajectory planning capabilities through various approaches, including fine-tuning existing models, utilizing chain-of-thought reasoning, and enabling language and action interactions [9][19]. - The study identifies 22 methods focused on customizing FMs for trajectory planning, detailing their functionalities and the importance of prompt design in model performance [20][32]. Challenges and Future Directions - The article outlines key challenges in deploying FMs in autonomous driving, such as reasoning costs, model size, and the need for suitable datasets for fine-tuning [5][12]. - Future research directions include addressing the efficiency, robustness, and transferability of models from simulation to real-world applications [12][14]. Comparative Analysis - The study contrasts its findings with existing literature, noting that while previous reviews cover various aspects of autonomous driving, this research specifically focuses on the application of FMs in trajectory planning [13][14]. Data and Model Design - The article discusses the importance of data curation for training FMs, emphasizing the need for structured datasets that include sensor data and trajectory pairs [24][28]. - It also highlights different model design strategies, including the use of existing visual language models and the combination of visual encoders with large language models [27][29]. Language and Action Interaction - The research explores models that incorporate language interaction capabilities, detailing how these models utilize visual question-answering datasets to enhance driving performance [38][39]. - It emphasizes the significance of training datasets and evaluation metrics in assessing the effectiveness of language interaction in trajectory planning [39][41].
“中文AI三大顶会”已有两家报导了理想近期AI进展
理想TOP2· 2025-11-09 14:59
Core Insights - The article discusses the rising prominence of Li Auto in the autonomous driving sector, particularly its recent advancements presented at the ICCV 2025 conference, where it introduced a new paradigm for autonomous driving that integrates world models with reinforcement learning [1][2][4]. Group 1: Company Developments - Li Auto's research and development in autonomous driving began in 2021, evolving from initial BEV solutions to more advanced systems [5]. - The company has significantly invested in AI, with nearly half of its R&D budget allocated to this area, indicating a strong commitment to integrating AI into its vehicle technology [2]. - Li Auto's recent presentation at ICCV 2025 highlighted its innovative approach, which combines synthetic data to address rare scenarios, leading to a notable improvement in human takeover mileage (MPI) [2][4]. Group 2: Industry Reception - The reception of Li Auto's advancements has been overwhelmingly positive, with many industry observers praising its research and development efforts, positioning it as a model for Chinese automotive companies [2][4]. - Articles from major Chinese AI platforms like Quantum Bit and Machine Heart have garnered significant attention, with one article achieving over 39,000 reads, reflecting the growing interest in Li Auto's developments [1][2]. Group 3: Competitive Landscape - Li Auto is recognized as a leading player in the Chinese autonomous driving space, with a notable presence in discussions surrounding AI and autonomous vehicle technology [22]. - The company aims to differentiate itself not just as an automotive manufacturer but as a competitive AI entity, aligning its goals with broader AI advancements and the five stages of AI development as defined by OpenAI [18][19].
小米智驾正在迎头赶上......
自动驾驶之心· 2025-11-03 00:04
Core Insights - Xiaomi has made significant strides in the autonomous driving sector since the establishment of its automotive division in September 2021, with plans to release the Xiaomi SU7 in March 2024 and the YU7 in June 2025 [2] - The company is actively engaging in advanced research, with a focus on integrating cutting-edge technologies into its autonomous driving solutions, as evidenced by a substantial number of research papers published by its automotive team [2] Research Developments - The AdaThinkDrive framework introduces a dual-mode reasoning mechanism in end-to-end autonomous driving, achieving a PDMS score of 90.3 in NAVSIM benchmark tests, surpassing the best pure vision baseline by 1.7 points [6] - EvaDrive presents an evolutionary adversarial policy optimization framework that successfully addresses trajectory generation and evaluation challenges, achieving optimal performance in both NAVSIM and Bench2Drive benchmarks [9] - MTRDrive enhances visual-language models for motion risk prediction by introducing a memory-tool synergistic reasoning framework, significantly improving generalization capabilities in autonomous driving tasks [13][14] Performance Metrics - The AdaThinkDrive framework has shown a 14% improvement in reasoning efficiency while effectively distinguishing when to apply reasoning in various driving scenarios [6] - EvaDrive achieved a PDMS score of 94.9 in NAVSIM v1, outperforming other methods like DiffusionDrive and DriveSuprim [9] - The DriveMRP-Agent demonstrated a remarkable zero-shot evaluation accuracy of 68.50% on real-world high-risk datasets, significantly improving from a baseline of 29.42% [15] Framework Innovations - ReCogDrive combines cognitive reasoning with reinforcement learning to enhance decision-making in autonomous driving, achieving a PDMS of 90.8 in NAVSIM tests [18] - The AgentThink framework integrates dynamic tool invocation with chain-of-thought reasoning, improving reasoning scores by 53.91% and answer accuracy by 33.54% in benchmark tests [22] - ORION framework effectively aligns semantic reasoning with action generation, achieving a driving score of 77.74 and a success rate of 54.62% in Bench2Drive evaluations [23] Data Generation Techniques - Dream4Drive introduces a 3D perception-guided synthetic data generation framework, significantly enhancing the performance of perception tasks with minimal synthetic sample usage [26] - The Genesis framework achieves joint generation of multi-view driving videos and LiDAR point cloud sequences, enhancing the realism and utility of autonomous driving simulation data [41] - The Uni-Gaussians method unifies camera and LiDAR simulation, demonstrating superior simulation quality in dynamic driving scenarios [42]
ICCV 2025「端到端自动驾驶」冠军方案分享!
自动驾驶之心· 2025-10-29 00:04
Core Insights - The article highlights the victory of Inspur's AI team in the Autonomous Grand Challenge 2025, where they achieved a score of 53.06 in the end-to-end autonomous driving track using their innovative framework "SimpleVSF" [2][7][13] - The framework integrates bird's-eye view perception trajectory prediction with a vision-language multimodal model, enhancing decision-making capabilities in complex traffic scenarios [2][5][8] Summary by Sections Competition Overview - The ICCV 2025 Autonomous Driving Challenge is a significant international event focusing on autonomous driving and embodied intelligence, featuring three main tracks [4] - The end-to-end driving challenge evaluates trajectory prediction and behavior planning using a data-driven simulation framework, emphasizing safety and efficiency across nine key metrics [4] Technical Challenges - End-to-end autonomous driving aims to reduce errors and information loss from traditional modular approaches, yet struggles with decision-making in complex real-world scenarios [5] - Current methods can identify basic elements but fail to understand higher-level semantics and situational awareness, leading to suboptimal decisions [5] Innovations in SimpleVSF Framework - The SimpleVSF framework bridges the gap between traditional trajectory planning and semantic understanding through a vision-language model (VLM) [7][8] - The VLM-enhanced scoring mechanism improves decision quality and scene adaptability, resulting in a 2% performance increase for single models and up to 6% in fusion decision-making [8][11] Decision-Making Mechanism - The dual fusion decision mechanism combines quantitative and qualitative assessments, ensuring optimal trajectory selection based on both numerical and semantic criteria [10][11] - The framework employs advanced models for generating diverse candidate trajectories and extracting robust environmental features, enhancing overall system performance [13] Achievements and Future Directions - The SimpleVSF framework's success in the challenge sets a new benchmark for end-to-end autonomous driving technology, supporting further advancements in the field [13] - Inspur's AI team aims to leverage their algorithmic and computational strengths to drive innovation in autonomous driving technology [13]
DeepSeek的终极野心:把大语言模型的基本语言都改造成图像
3 6 Ke· 2025-10-21 12:52
Core Insights - DeepSeek has open-sourced DeepSeek-OCR, an OCR model that achieves state-of-the-art results on benchmarks like OmniDocBench [1] - The motivation behind entering the OCR field is to address the computational bottleneck of long context processing in large language models (LLMs) [4][6] - The paper proposes that text information can be efficiently compressed through optical 2D mapping, allowing visual language models (VLMs) to decompress original information from images [4][6] Group 1: Long Context Processing - The pursuit of longer context in LLMs has led to a competitive arms race, with token windows expanding from thousands to millions [7] - The core limitation arises from the attention mechanism in the Transformer architecture, where computational complexity and memory usage grow quadratically with sequence length [7] - DeepSeek-AI's engineers propose a fundamental question: can the number of tokens be compressed rather than just optimizing attention calculations? [7][10] Group 2: Visual Tokens vs. Text Tokens - Visual tokens are the basic units of information processed by visual models, while text tokens are used by LLMs [8] - A 1024x1024 image can be divided into 4096 visual tokens, significantly reducing the number of tokens needed compared to text representation [9] - The understanding that visual modalities can serve as efficient compression mediums for text information led to the creation of DeepSeek-OCR [9] Group 3: DeepEncoder and Compression Techniques - DeepSeek-OCR is essentially a proof of concept for an "optical compression-decompression" system [10] - The DeepEncoder, a key innovation, is designed to handle high-resolution inputs while producing minimal visual tokens [11][12] - The architecture consists of three stages: a local detail processor, a compression module, and a global attention layer [14][16] Group 4: Performance Metrics - Experimental results show a 10.5x compression rate with 64 visual tokens decoding 600-700 text tokens, achieving an OCR accuracy of 96.5% [17][18] - At a 20x compression rate, the model maintains around 60% accuracy while decoding over 1200 text tokens [17][18] - DeepSeek-OCR outperforms existing models like GOT-OCR2.0 and MinerU2.0 in terms of performance and token efficiency [19][20] Group 5: Future Vision and Memory Simulation - The team aims to simulate human memory's forgetting mechanism, which naturally prioritizes relevant information while compressing less important details [25][27] - The multi-resolution design of DeepSeek-OCR provides a technical foundation for managing memory in a way that mimics human cognitive processes [29][30] - The ultimate goal is to create a system that balances information retention and computational efficiency, potentially leading to a new paradigm in AI memory and input systems [32][35]
特斯拉call back李想的线索
理想TOP2· 2025-10-21 03:13
Core Insights - The article discusses advancements in autonomous driving technology, particularly focusing on Tesla's use of similar techniques as VLA in their V14 model, highlighting the importance of spatial understanding and multitasking capabilities [1][2] - Ashok Elluswamy, Tesla's AI software VP, emphasized the integration of various data sources in Tesla's Full Self-Driving (FSD) system during a workshop at ICCV 2025, indicating a significant upgrade in their autonomous driving capabilities [1][2] Group 1: Tesla's Technological Advancements - Tesla's V14 model utilizes technology akin to VLA, showcasing enhanced spatial comprehension and multitasking abilities, which are critical for long-duration tasks [1] - Elluswamy's presentation at ICCV 2025 highlighted the FSD system's reliance on a comprehensive network that incorporates camera data, LBS positioning, and audio inputs, culminating in action execution [1][2] Group 2: ICCV 2025 Workshop Details - The ICCV 2025 workshop focused on distilling foundation models for autonomous driving, aiming to improve the deployment of large models like vision-language models and generative AI in vehicles [3] - Key topics included foundational models for robotics, knowledge distillation, and multimodal fusion, indicating a broad exploration of AI applications in autonomous driving [6][7]
光会“看”和“说”还不够,还得会“算”!Tool-Use+强化学习:TIGeR让机器人实现精准操作
具身智能之心· 2025-10-11 16:02
Core Insights - The article discusses the limitations of current Vision-Language Models (VLMs) in accurately interpreting and executing spatial commands in robotics, emphasizing the need for precise geometric reasoning and tool integration [2][5]. Group 1: TIGeR Framework - The Tool-Integrated Geometric Reasoning (TIGeR) framework enhances VLMs by integrating tool usage and reinforcement learning to improve their ability to perform precise calculations in a three-dimensional space [2][6]. - TIGeR allows AI models to transition from qualitative perception to quantitative computation, addressing the core pain points of existing VLMs [2][7]. Group 2: Advantages of TIGeR - TIGeR provides precise localization by integrating depth information and camera parameters, enabling the accurate conversion of commands like "10 centimeters above" into three-dimensional coordinates [7]. - The framework supports multi-view unified reasoning, allowing information from various perspectives to be merged and reasoned within a consistent world coordinate system [7]. - The model's reasoning process is transparent, making it easier to debug and optimize by clearly showing the tools used, parameters input, and results obtained [7]. Group 3: Training Process - The training of TIGeR involves a two-phase process: first, supervised learning to teach basic tool usage and reasoning chains, followed by reinforcement learning to refine the model's tool usage skills through a hierarchical reward mechanism [8][10]. - The hierarchical reward mechanism evaluates not only the correctness of the final answer but also the accuracy of the process, including tool selection and parameter precision [8]. Group 4: Data Utilization - The TIGeR-300K dataset, consisting of 300,000 samples, was created to train the model in solving geometric problems, ensuring both accuracy and diversity in the tasks covered [10][13]. - The dataset construction involved template-based generation and large model rewriting to enhance generalization and flexibility, ensuring the model can handle complex real-world instructions [13]. Group 5: Performance Metrics - TIGeR outperforms other leading VLMs in spatial understanding benchmarks, achieving scores such as 93.85 in 2D-Rel and 96.33 in 3D-Depth [10][14]. - The model's performance in various spatial reasoning tasks demonstrates its capability to execute operations that require precise three-dimensional positioning, which other models struggle to achieve [16].
机器人「看片」自学新技能:NovaFlow从生成视频中提取动作流,实现零样本操控
机器之心· 2025-10-09 02:24
Core Insights - The article discusses the development of NovaFlow, a novel framework for enabling robots to perform complex manipulation tasks without requiring extensive training data or demonstrations, leveraging large video generation models to extract common-sense knowledge from vast amounts of internet video content [2][4][23] Group 1: NovaFlow Framework Overview - NovaFlow aims to decouple task understanding from low-level control, allowing robots to learn from generated videos rather than requiring human demonstrations or trial-and-error learning [4][23] - The framework consists of two main components: the Actionable Flow Generator and the Flow Executor, which work together to translate natural language instructions into executable 3D object flows [8][9] Group 2: Actionable Flow Generation - The Actionable Flow Generator translates user input (natural language and RGB-D images) into a 3D action flow through a four-step process, including video generation, 2D to 3D enhancement, 3D point tracking, and object segmentation [9][12][14] - The generator utilizes state-of-the-art video generation models to create instructional videos, which are then processed to extract actionable 3D object flows [12][14] Group 3: Action Flow Execution - The Flow Executor converts the abstract 3D object flows into specific robot action sequences, employing different strategies based on the type of object being manipulated [15][20] - The framework has been tested on various robotic platforms, demonstrating its effectiveness in manipulating rigid, articulated, and deformable objects [16][18] Group 4: Experimental Results - NovaFlow outperformed other zero-shot methods and even surpassed traditional imitation learning approaches that required multiple demonstration data points, showcasing the potential of extracting common-sense knowledge from generated videos [19][20] - The framework achieved high success rates in tasks involving rigid and articulated objects, as well as more complex tasks with deformable objects, indicating its robustness and versatility [19][20] Group 5: Challenges and Future Directions - Despite its successes, the research highlights limitations in the current open-loop planning system, particularly in the physical execution phase, suggesting a need for closed-loop feedback systems to enhance robustness against real-world uncertainties [23] - Future research will focus on developing systems that can dynamically adjust or replan actions based on real-time environmental feedback, further advancing the capabilities of autonomous robots [23]
RoboDexVLM:基于VLM分层架构的通用灵巧机器人操作
具身智能之心· 2025-09-26 00:04
Core Insights - RoboDexVLM is an innovative robot task planning and grasp detection framework designed for collaborative robotic arms equipped with dexterous hands, focusing on complex long-sequence tasks and diverse object manipulation [2][6] Group 1: Framework Overview - The framework utilizes a robust task planner with a task-level recovery mechanism, leveraging visual language models to interpret and execute open vocabulary instructions for completing long-sequence tasks [2][6] - It introduces a language-guided dexterous grasp perception algorithm, specifically designed for zero-shot dexterous manipulation of diverse objects and instructions [2][6] - Comprehensive experimental results validate RoboDexVLM's effectiveness, adaptability, and robustness in handling long-sequence scenarios and executing dexterous grasping tasks [2][6] Group 2: Key Features - The framework allows robots to understand natural language commands, enabling seamless human-robot interaction [7] - It supports zero-shot grasping of various objects, showcasing the dexterous hand's capability to manipulate items of different shapes and sizes [7] - The visual language model acts as the "brain" for long-range task planning, ensuring that the robot does not lose track of its objectives [7] Group 3: Practical Applications - RoboDexVLM represents the first general-purpose dexterous robot operation framework that integrates visual language models, breaking through the limitations of traditional and end-to-end methods [6][7] - The framework's real-world performance demonstrates its potential in embodied intelligence and human-robot collaboration [6][7]