视觉语言模型
Search documents
“中文AI三大顶会”已有两家报导了理想近期AI进展
理想TOP2· 2025-11-09 14:59
Core Insights - The article discusses the rising prominence of Li Auto in the autonomous driving sector, particularly its recent advancements presented at the ICCV 2025 conference, where it introduced a new paradigm for autonomous driving that integrates world models with reinforcement learning [1][2][4]. Group 1: Company Developments - Li Auto's research and development in autonomous driving began in 2021, evolving from initial BEV solutions to more advanced systems [5]. - The company has significantly invested in AI, with nearly half of its R&D budget allocated to this area, indicating a strong commitment to integrating AI into its vehicle technology [2]. - Li Auto's recent presentation at ICCV 2025 highlighted its innovative approach, which combines synthetic data to address rare scenarios, leading to a notable improvement in human takeover mileage (MPI) [2][4]. Group 2: Industry Reception - The reception of Li Auto's advancements has been overwhelmingly positive, with many industry observers praising its research and development efforts, positioning it as a model for Chinese automotive companies [2][4]. - Articles from major Chinese AI platforms like Quantum Bit and Machine Heart have garnered significant attention, with one article achieving over 39,000 reads, reflecting the growing interest in Li Auto's developments [1][2]. Group 3: Competitive Landscape - Li Auto is recognized as a leading player in the Chinese autonomous driving space, with a notable presence in discussions surrounding AI and autonomous vehicle technology [22]. - The company aims to differentiate itself not just as an automotive manufacturer but as a competitive AI entity, aligning its goals with broader AI advancements and the five stages of AI development as defined by OpenAI [18][19].
小米智驾正在迎头赶上......
自动驾驶之心· 2025-11-03 00:04
Core Insights - Xiaomi has made significant strides in the autonomous driving sector since the establishment of its automotive division in September 2021, with plans to release the Xiaomi SU7 in March 2024 and the YU7 in June 2025 [2] - The company is actively engaging in advanced research, with a focus on integrating cutting-edge technologies into its autonomous driving solutions, as evidenced by a substantial number of research papers published by its automotive team [2] Research Developments - The AdaThinkDrive framework introduces a dual-mode reasoning mechanism in end-to-end autonomous driving, achieving a PDMS score of 90.3 in NAVSIM benchmark tests, surpassing the best pure vision baseline by 1.7 points [6] - EvaDrive presents an evolutionary adversarial policy optimization framework that successfully addresses trajectory generation and evaluation challenges, achieving optimal performance in both NAVSIM and Bench2Drive benchmarks [9] - MTRDrive enhances visual-language models for motion risk prediction by introducing a memory-tool synergistic reasoning framework, significantly improving generalization capabilities in autonomous driving tasks [13][14] Performance Metrics - The AdaThinkDrive framework has shown a 14% improvement in reasoning efficiency while effectively distinguishing when to apply reasoning in various driving scenarios [6] - EvaDrive achieved a PDMS score of 94.9 in NAVSIM v1, outperforming other methods like DiffusionDrive and DriveSuprim [9] - The DriveMRP-Agent demonstrated a remarkable zero-shot evaluation accuracy of 68.50% on real-world high-risk datasets, significantly improving from a baseline of 29.42% [15] Framework Innovations - ReCogDrive combines cognitive reasoning with reinforcement learning to enhance decision-making in autonomous driving, achieving a PDMS of 90.8 in NAVSIM tests [18] - The AgentThink framework integrates dynamic tool invocation with chain-of-thought reasoning, improving reasoning scores by 53.91% and answer accuracy by 33.54% in benchmark tests [22] - ORION framework effectively aligns semantic reasoning with action generation, achieving a driving score of 77.74 and a success rate of 54.62% in Bench2Drive evaluations [23] Data Generation Techniques - Dream4Drive introduces a 3D perception-guided synthetic data generation framework, significantly enhancing the performance of perception tasks with minimal synthetic sample usage [26] - The Genesis framework achieves joint generation of multi-view driving videos and LiDAR point cloud sequences, enhancing the realism and utility of autonomous driving simulation data [41] - The Uni-Gaussians method unifies camera and LiDAR simulation, demonstrating superior simulation quality in dynamic driving scenarios [42]
ICCV 2025「端到端自动驾驶」冠军方案分享!
自动驾驶之心· 2025-10-29 00:04
Core Insights - The article highlights the victory of Inspur's AI team in the Autonomous Grand Challenge 2025, where they achieved a score of 53.06 in the end-to-end autonomous driving track using their innovative framework "SimpleVSF" [2][7][13] - The framework integrates bird's-eye view perception trajectory prediction with a vision-language multimodal model, enhancing decision-making capabilities in complex traffic scenarios [2][5][8] Summary by Sections Competition Overview - The ICCV 2025 Autonomous Driving Challenge is a significant international event focusing on autonomous driving and embodied intelligence, featuring three main tracks [4] - The end-to-end driving challenge evaluates trajectory prediction and behavior planning using a data-driven simulation framework, emphasizing safety and efficiency across nine key metrics [4] Technical Challenges - End-to-end autonomous driving aims to reduce errors and information loss from traditional modular approaches, yet struggles with decision-making in complex real-world scenarios [5] - Current methods can identify basic elements but fail to understand higher-level semantics and situational awareness, leading to suboptimal decisions [5] Innovations in SimpleVSF Framework - The SimpleVSF framework bridges the gap between traditional trajectory planning and semantic understanding through a vision-language model (VLM) [7][8] - The VLM-enhanced scoring mechanism improves decision quality and scene adaptability, resulting in a 2% performance increase for single models and up to 6% in fusion decision-making [8][11] Decision-Making Mechanism - The dual fusion decision mechanism combines quantitative and qualitative assessments, ensuring optimal trajectory selection based on both numerical and semantic criteria [10][11] - The framework employs advanced models for generating diverse candidate trajectories and extracting robust environmental features, enhancing overall system performance [13] Achievements and Future Directions - The SimpleVSF framework's success in the challenge sets a new benchmark for end-to-end autonomous driving technology, supporting further advancements in the field [13] - Inspur's AI team aims to leverage their algorithmic and computational strengths to drive innovation in autonomous driving technology [13]
DeepSeek的终极野心:把大语言模型的基本语言都改造成图像
3 6 Ke· 2025-10-21 12:52
Core Insights - DeepSeek has open-sourced DeepSeek-OCR, an OCR model that achieves state-of-the-art results on benchmarks like OmniDocBench [1] - The motivation behind entering the OCR field is to address the computational bottleneck of long context processing in large language models (LLMs) [4][6] - The paper proposes that text information can be efficiently compressed through optical 2D mapping, allowing visual language models (VLMs) to decompress original information from images [4][6] Group 1: Long Context Processing - The pursuit of longer context in LLMs has led to a competitive arms race, with token windows expanding from thousands to millions [7] - The core limitation arises from the attention mechanism in the Transformer architecture, where computational complexity and memory usage grow quadratically with sequence length [7] - DeepSeek-AI's engineers propose a fundamental question: can the number of tokens be compressed rather than just optimizing attention calculations? [7][10] Group 2: Visual Tokens vs. Text Tokens - Visual tokens are the basic units of information processed by visual models, while text tokens are used by LLMs [8] - A 1024x1024 image can be divided into 4096 visual tokens, significantly reducing the number of tokens needed compared to text representation [9] - The understanding that visual modalities can serve as efficient compression mediums for text information led to the creation of DeepSeek-OCR [9] Group 3: DeepEncoder and Compression Techniques - DeepSeek-OCR is essentially a proof of concept for an "optical compression-decompression" system [10] - The DeepEncoder, a key innovation, is designed to handle high-resolution inputs while producing minimal visual tokens [11][12] - The architecture consists of three stages: a local detail processor, a compression module, and a global attention layer [14][16] Group 4: Performance Metrics - Experimental results show a 10.5x compression rate with 64 visual tokens decoding 600-700 text tokens, achieving an OCR accuracy of 96.5% [17][18] - At a 20x compression rate, the model maintains around 60% accuracy while decoding over 1200 text tokens [17][18] - DeepSeek-OCR outperforms existing models like GOT-OCR2.0 and MinerU2.0 in terms of performance and token efficiency [19][20] Group 5: Future Vision and Memory Simulation - The team aims to simulate human memory's forgetting mechanism, which naturally prioritizes relevant information while compressing less important details [25][27] - The multi-resolution design of DeepSeek-OCR provides a technical foundation for managing memory in a way that mimics human cognitive processes [29][30] - The ultimate goal is to create a system that balances information retention and computational efficiency, potentially leading to a new paradigm in AI memory and input systems [32][35]
特斯拉call back李想的线索
理想TOP2· 2025-10-21 03:13
Core Insights - The article discusses advancements in autonomous driving technology, particularly focusing on Tesla's use of similar techniques as VLA in their V14 model, highlighting the importance of spatial understanding and multitasking capabilities [1][2] - Ashok Elluswamy, Tesla's AI software VP, emphasized the integration of various data sources in Tesla's Full Self-Driving (FSD) system during a workshop at ICCV 2025, indicating a significant upgrade in their autonomous driving capabilities [1][2] Group 1: Tesla's Technological Advancements - Tesla's V14 model utilizes technology akin to VLA, showcasing enhanced spatial comprehension and multitasking abilities, which are critical for long-duration tasks [1] - Elluswamy's presentation at ICCV 2025 highlighted the FSD system's reliance on a comprehensive network that incorporates camera data, LBS positioning, and audio inputs, culminating in action execution [1][2] Group 2: ICCV 2025 Workshop Details - The ICCV 2025 workshop focused on distilling foundation models for autonomous driving, aiming to improve the deployment of large models like vision-language models and generative AI in vehicles [3] - Key topics included foundational models for robotics, knowledge distillation, and multimodal fusion, indicating a broad exploration of AI applications in autonomous driving [6][7]
光会“看”和“说”还不够,还得会“算”!Tool-Use+强化学习:TIGeR让机器人实现精准操作
具身智能之心· 2025-10-11 16:02
Core Insights - The article discusses the limitations of current Vision-Language Models (VLMs) in accurately interpreting and executing spatial commands in robotics, emphasizing the need for precise geometric reasoning and tool integration [2][5]. Group 1: TIGeR Framework - The Tool-Integrated Geometric Reasoning (TIGeR) framework enhances VLMs by integrating tool usage and reinforcement learning to improve their ability to perform precise calculations in a three-dimensional space [2][6]. - TIGeR allows AI models to transition from qualitative perception to quantitative computation, addressing the core pain points of existing VLMs [2][7]. Group 2: Advantages of TIGeR - TIGeR provides precise localization by integrating depth information and camera parameters, enabling the accurate conversion of commands like "10 centimeters above" into three-dimensional coordinates [7]. - The framework supports multi-view unified reasoning, allowing information from various perspectives to be merged and reasoned within a consistent world coordinate system [7]. - The model's reasoning process is transparent, making it easier to debug and optimize by clearly showing the tools used, parameters input, and results obtained [7]. Group 3: Training Process - The training of TIGeR involves a two-phase process: first, supervised learning to teach basic tool usage and reasoning chains, followed by reinforcement learning to refine the model's tool usage skills through a hierarchical reward mechanism [8][10]. - The hierarchical reward mechanism evaluates not only the correctness of the final answer but also the accuracy of the process, including tool selection and parameter precision [8]. Group 4: Data Utilization - The TIGeR-300K dataset, consisting of 300,000 samples, was created to train the model in solving geometric problems, ensuring both accuracy and diversity in the tasks covered [10][13]. - The dataset construction involved template-based generation and large model rewriting to enhance generalization and flexibility, ensuring the model can handle complex real-world instructions [13]. Group 5: Performance Metrics - TIGeR outperforms other leading VLMs in spatial understanding benchmarks, achieving scores such as 93.85 in 2D-Rel and 96.33 in 3D-Depth [10][14]. - The model's performance in various spatial reasoning tasks demonstrates its capability to execute operations that require precise three-dimensional positioning, which other models struggle to achieve [16].
机器人「看片」自学新技能:NovaFlow从生成视频中提取动作流,实现零样本操控
机器之心· 2025-10-09 02:24
Core Insights - The article discusses the development of NovaFlow, a novel framework for enabling robots to perform complex manipulation tasks without requiring extensive training data or demonstrations, leveraging large video generation models to extract common-sense knowledge from vast amounts of internet video content [2][4][23] Group 1: NovaFlow Framework Overview - NovaFlow aims to decouple task understanding from low-level control, allowing robots to learn from generated videos rather than requiring human demonstrations or trial-and-error learning [4][23] - The framework consists of two main components: the Actionable Flow Generator and the Flow Executor, which work together to translate natural language instructions into executable 3D object flows [8][9] Group 2: Actionable Flow Generation - The Actionable Flow Generator translates user input (natural language and RGB-D images) into a 3D action flow through a four-step process, including video generation, 2D to 3D enhancement, 3D point tracking, and object segmentation [9][12][14] - The generator utilizes state-of-the-art video generation models to create instructional videos, which are then processed to extract actionable 3D object flows [12][14] Group 3: Action Flow Execution - The Flow Executor converts the abstract 3D object flows into specific robot action sequences, employing different strategies based on the type of object being manipulated [15][20] - The framework has been tested on various robotic platforms, demonstrating its effectiveness in manipulating rigid, articulated, and deformable objects [16][18] Group 4: Experimental Results - NovaFlow outperformed other zero-shot methods and even surpassed traditional imitation learning approaches that required multiple demonstration data points, showcasing the potential of extracting common-sense knowledge from generated videos [19][20] - The framework achieved high success rates in tasks involving rigid and articulated objects, as well as more complex tasks with deformable objects, indicating its robustness and versatility [19][20] Group 5: Challenges and Future Directions - Despite its successes, the research highlights limitations in the current open-loop planning system, particularly in the physical execution phase, suggesting a need for closed-loop feedback systems to enhance robustness against real-world uncertainties [23] - Future research will focus on developing systems that can dynamically adjust or replan actions based on real-time environmental feedback, further advancing the capabilities of autonomous robots [23]
RoboDexVLM:基于VLM分层架构的通用灵巧机器人操作
具身智能之心· 2025-09-26 00:04
Core Insights - RoboDexVLM is an innovative robot task planning and grasp detection framework designed for collaborative robotic arms equipped with dexterous hands, focusing on complex long-sequence tasks and diverse object manipulation [2][6] Group 1: Framework Overview - The framework utilizes a robust task planner with a task-level recovery mechanism, leveraging visual language models to interpret and execute open vocabulary instructions for completing long-sequence tasks [2][6] - It introduces a language-guided dexterous grasp perception algorithm, specifically designed for zero-shot dexterous manipulation of diverse objects and instructions [2][6] - Comprehensive experimental results validate RoboDexVLM's effectiveness, adaptability, and robustness in handling long-sequence scenarios and executing dexterous grasping tasks [2][6] Group 2: Key Features - The framework allows robots to understand natural language commands, enabling seamless human-robot interaction [7] - It supports zero-shot grasping of various objects, showcasing the dexterous hand's capability to manipulate items of different shapes and sizes [7] - The visual language model acts as the "brain" for long-range task planning, ensuring that the robot does not lose track of its objectives [7] Group 3: Practical Applications - RoboDexVLM represents the first general-purpose dexterous robot operation framework that integrates visual language models, breaking through the limitations of traditional and end-to-end methods [6][7] - The framework's real-world performance demonstrates its potential in embodied intelligence and human-robot collaboration [6][7]
全新开源模型复现o3视觉推理,无需大量训练即可实现深度思考
量子位· 2025-09-15 03:59
Core Viewpoint - The article discusses the development of Mini-o3, an advanced visual language model (VLM) that enables multi-round visual reasoning, significantly improving upon previous models by allowing for deep reasoning across dozens of steps [1][2][15]. Group 1: Model Development - Mini-o3 is developed by a collaboration between ByteDance and the University of Hong Kong, designed to perform long-cycle visual search without extensive training resources [13]. - The model can extend its reasoning capabilities from a training limit of 6 rounds to dozens during testing, showcasing its advanced multi-modal reasoning abilities [2][15]. Group 2: Key Design Features - Mini-o3 incorporates three critical design elements: the VisualProbe dataset for exploratory reasoning, an iterative data collection process for diverse reasoning strategies, and a super-round masking strategy to balance training efficiency with testing scalability [17][19][34]. - The VisualProbe dataset consists of thousands of visual search challenges specifically designed for deep reasoning tasks, enhancing the model's training [17][38]. Group 3: Training Phases - The training of Mini-o3 occurs in two phases: a cold-start supervised fine-tuning (SFT) phase to activate multi-round tool usage, and a reinforcement learning (RL) phase to optimize interaction rounds [19][25]. - The cold-start SFT phase utilizes a small number of manually constructed samples to generate diverse reasoning trajectories, resulting in approximately 6000 cold-start reasoning paths [24][46]. Group 4: Performance Evaluation - Mini-o3 outperforms existing models in visual search tasks, achieving the best performance across various benchmarks, including VisualProbe, V*Bench, and HR-Bench [43][44]. - The model's performance is attributed to its ability to maintain complex and deep reasoning trajectories, with significant improvements noted in challenging tasks [44][48]. Group 5: Experimental Insights - Experiments indicate that removing RL data leads to a performance drop of about 8.6 points on VisualProbe-Hard, highlighting the importance of challenging RL samples for encouraging complex reasoning [45]. - The super-round masking technique effectively enhances RL performance, particularly in multi-round interaction scenarios, by stabilizing the training process and enabling extended reasoning during testing [48]. Group 6: Conclusion and Future Directions - The technical framework of Mini-o3 provides practical guidance for the development of multi-round interactive multi-modal models and their applications in reinforcement learning [52]. - The research team has made all related code open-source, promoting further exploration and development in this field [53].
自动驾驶超视距VLA如何实现?小鹏NavigScene另辟蹊径!
自动驾驶之心· 2025-09-04 23:33
Core Viewpoint - The article discusses the limitations of current autonomous driving systems in bridging the gap between local perception and global navigation, highlighting the introduction of NavigScene as a solution to enhance navigation capabilities in autonomous vehicles [3][4]. Group 1: Research and Development - Autonomous driving systems have made significant progress in local visual information processing, but they struggle to integrate broader navigation context used by human drivers [4][9]. - NavigScene is introduced as a navigation-guided natural language dataset that simulates a human-like driving environment within autonomous systems [5][9]. - The development of three complementary paradigms utilizing NavigScene aims to improve reasoning, preference optimization, and the integration of visual-language-action models [5][9]. Group 2: Methodologies - Navigation-guided reasoning enhances visual language models by incorporating navigation context into prompting methods [5]. - Navigation-guided preference optimization is a reinforcement learning approach that improves visual language model responses by establishing preference relationships based on navigation-related information [5]. - The navigation-guided vision-language-action model integrates navigation guidance and visual language models with traditional end-to-end driving models through feature fusion [5]. Group 3: Event and Engagement - A live session is scheduled to discuss the advancements and methodologies related to NavigScene, emphasizing its role in overcoming the limitations of current autonomous driving systems [4][9].