Workflow
视觉语言模型(VLM)
icon
Search documents
自动驾驶VLA工作汇总(模块化/端到端/推理增强)
自动驾驶之心· 2025-08-12 11:42
Core Insights - The article focuses on the development and algorithms of Vision-Language Action (VLA) models in autonomous driving over the past two years, providing a comprehensive overview of various research papers and projects in this field [1]. Group 1: VLA Preceding Work - The article mentions several key papers that serve as interpreters for VLA, including "DriveGPT4" and "TS-VLM," which focus on enhancing autonomous driving perception through large language models [3]. - Additional papers like "DynRsl-VLM" are highlighted for their contributions to improving perception in autonomous driving [3]. Group 2: Modular VLA - The article lists various end-to-end VLA models, such as "RAG-Driver" and "OpenDriveVLA," which aim to generalize driving explanations and enhance autonomous driving capabilities [4]. - Other notable models include "DriveMoE" and "LangCoop," which focus on collaborative driving and knowledge-enhanced safe driving [4]. Group 3: Enhanced Reasoning in VLA - The article discusses models like "ADriver-I" and "EMMA," which contribute to the development of general world models and multimodal approaches for autonomous driving [6]. - Papers such as "DiffVLA" and "S4-Driver" are mentioned for their innovative approaches to planning and representation in autonomous driving [6]. Group 4: Community and Resources - The article emphasizes the establishment of a community for knowledge sharing in autonomous driving, featuring over 40 technical routes and inviting industry experts for discussions [7]. - It also highlights the availability of job opportunities and a comprehensive entry-level technical stack for newcomers in the field [12][14]. Group 5: Educational Resources - The article provides a structured learning roadmap for various aspects of autonomous driving, including perception, simulation, and planning control [15]. - It mentions the compilation of numerous datasets and open-source projects to facilitate learning and research in the autonomous driving sector [15].
本来决定去具身,现在有点犹豫了。。。
自动驾驶之心· 2025-08-11 12:17
Core Insights - Embodied intelligence is a hot topic this year, transitioning from previous years' silence to last year's frenzy, and now gradually cooling down as the industry realizes that embodied robots are far from being productive [1] Group 1: Industry Trends - The demand for multi-sensor fusion and positioning in robotics is significant, with a focus on SLAM and ROS technologies [3] - Many robotics companies are rapidly developing and have secured considerable funding, indicating a promising future for the sector [3] - Traditional robotics remains the main product line, despite the excitement around embodied intelligence [3] Group 2: Community and Resources - The community has established a closed loop across various fields including industry, academia, and job seeking, aiming to create a valuable exchange platform [4][6] - The community offers access to over 40 technical routes and invites industry leaders for discussions, enhancing learning and networking opportunities [6][20] - Members can freely ask questions regarding job choices or research directions, receiving guidance from experienced professionals [83] Group 3: Educational Content - Comprehensive resources for beginners and advanced learners are available, including technical stacks and learning roadmaps for autonomous driving and robotics [13][16] - The community has compiled a list of notable domestic and international research labs and companies in the autonomous driving and robotics sectors, aiding members in their academic and career pursuits [27][29]
「一只手有几根手指」,你的GPT-5答对了吗?
机器之心· 2025-08-11 10:40
Core Viewpoint - The article discusses the limitations of advanced language models like GPT-5 in understanding basic visual concepts, highlighting the need for vision-centric models to improve visual comprehension and reasoning capabilities [2][26]. Group 1 - Tairan He points out that while language is a powerful tool, it struggles to fully meet the needs of the visual and robotics fields [2]. - There is a call for the development of vision-centric language models (VLM) and vision-language-action (VLA) models to address these shortcomings [3]. - The ambiguity in the definition of "fingers" illustrates the challenges language models face in interpreting visual information accurately [4][6]. Group 2 - The article mentions that even top models like Gemini 2.5 Pro have failed to provide correct answers to basic questions, indicating a lack of robust visual understanding [10][24]. - Tairan He references a paper by the Sseynin team that proposes a rigorous evaluation method for assessing the visual capabilities of multimodal large language models (MLLM) [28]. - The new benchmark test, CV-Bench, focuses on evaluating models' abilities in object counting, spatial reasoning, and depth perception, establishing stricter assessment standards [31]. Group 3 - Research shows that while advanced VLMs can achieve 100% accuracy in recognizing common objects, their performance drops to about 17% when dealing with counterfactual images [33]. - The article emphasizes that VLMs rely on memorized knowledge rather than true visual analysis, which limits their effectiveness [34]. - Martin Ziqiao Ma argues that initializing VLA models with large language models is a tempting but misleading approach, as it does not address fundamental perception issues [36].
自动驾驶二十年,这个自动驾驶黄埔军校一直在精打细磨...
自动驾驶之心· 2025-08-09 16:03
Core Viewpoint - The article emphasizes the ongoing evolution and critical phase of the autonomous driving industry, highlighting the transition from modular approaches to end-to-end/VLA methods, and the community's commitment to fostering knowledge and collaboration in this field [2][4]. Group 1: Industry Development - Since Google's initiation of autonomous driving technology research in 2009, the industry has progressed significantly, now entering a crucial phase of development [2]. - The community aims to integrate intelligent driving into daily transportation, reflecting a growing expectation for advancements in autonomous driving capabilities [2]. Group 2: Community Initiatives - The community has established a knowledge-sharing platform, offering resources across various domains such as industry insights, academic research, and job opportunities [2][4]. - Plans to enhance community engagement include monthly online discussions and roundtable interviews with industry and academic leaders [2]. Group 3: Educational Resources - The community has compiled over 40 technical routes to assist individuals at different levels, from beginners to those seeking advanced knowledge in autonomous driving [4][16]. - A comprehensive entry-level technical stack and roadmap have been developed for newcomers to the field [9]. Group 4: Job Opportunities and Networking - The community has established internal referral mechanisms with multiple autonomous driving companies, facilitating job placements for members [7][14]. - Continuous job sharing and networking opportunities are provided to create a complete ecosystem for autonomous driving professionals [14][80]. Group 5: Research and Technical Focus - The community has gathered extensive resources on various research areas, including 3D target detection, BEV perception, and multi-sensor fusion, to support practical applications in autonomous driving [16][30][32]. - Detailed summaries of cutting-edge topics such as end-to-end driving, world models, and visual language models (VLM) have been compiled to keep members informed about the latest advancements [34][40][42].
自动驾驶大模型方案:视觉语言模型VLM工作一览,面向量产和研究~
自动驾驶之心· 2025-08-06 23:34
Core Insights - The article emphasizes the transformative potential of Vision-Language Models (VLMs) in enhancing the perception and cognitive capabilities of autonomous driving systems, enabling them to not only "see" but also "understand" complex driving environments [2][3]. Group 1: VLM Applications in Autonomous Driving - VLMs can surpass traditional visual models by integrating camera images or video streams to comprehend semantic information in traffic scenes, such as recognizing complex scenarios like "a pedestrian waving to cross the street" [6]. - VLMs facilitate the conversion of intricate visual scenes into clear natural language descriptions, enhancing the interpretability of decisions made by autonomous systems, which aids in debugging and increases trust among passengers and regulators [6]. - VLMs are crucial for natural language interactions in future smart cabins, allowing passengers to communicate intentions to vehicles through spoken commands [6]. Group 2: Scenario Generation and Testing - The article introduces CrashAgent, a multi-agent framework that utilizes multi-modal large language models to convert accident reports into structured scenarios for simulation environments, addressing the long-tail distribution issue in existing datasets [7]. - CurricuVLM is proposed as a personalized curriculum learning framework that leverages VLMs to analyze agent behavior and dynamically generate tailored training scenarios, improving safety in autonomous driving [13]. - TRACE is a framework that generates key test cases from real accident reports, significantly enhancing the efficiency of defect detection in autonomous driving systems [17]. Group 3: Out-of-Distribution (OOD) Scenario Generation - A framework utilizing large language models is proposed to generate diverse OOD driving scenarios, addressing the challenges posed by the sparsity of such scenarios in urban driving datasets [21][22]. - The article discusses the development of a method to automatically convert real-world driving videos into detailed simulation scenarios, enhancing the testing of autonomous driving systems [26]. Group 4: Enhancing Safety and Robustness - WEDGE is introduced as a synthetic dataset created from generative vision-language models, aimed at improving the robustness of perception systems in extreme weather conditions [39][40]. - LKAlert is a predictive alert system that utilizes VLMs to forecast potential lane-keeping assist (LKA) risks, enhancing driver situational awareness and trust [54][55]. Group 5: Advancements in Decision-Making Frameworks - The CBR-LLM framework combines semantic scene understanding with case retrieval to enhance decision-making in complex driving scenarios, improving accuracy and reasoning consistency [44][45]. - ORION is presented as a holistic end-to-end autonomous driving framework that integrates visual-language instructed action generation, achieving superior performance in closed-loop evaluations [69][70].
4000人了,死磕技术的自动驾驶黄埔军校到底做了哪些事情?
自动驾驶之心· 2025-07-31 06:19
Core Viewpoint - The article emphasizes the importance of creating an engaging learning environment in the field of autonomous driving and AI, aiming to bridge the gap between industry and academia while providing valuable resources for students and professionals [1]. Group 1: Community and Resources - The community has established a closed loop across various fields including industry, academia, job seeking, and Q&A exchanges, focusing on what type of community is needed [1][2]. - The platform offers cutting-edge academic content, industry roundtables, open-source code solutions, and timely job information, streamlining the search for resources [2][3]. - A comprehensive technical roadmap with over 40 technical routes has been organized, catering to various interests from consulting applications to the latest VLA benchmarks [2][14]. Group 2: Educational Content - The community provides a series of original live courses and video tutorials covering topics such as automatic labeling, data processing, and simulation engineering [4][10]. - Various learning paths are available for beginners, as well as advanced resources for those already engaged in research, ensuring a supportive environment for all levels [8][10]. - The community has compiled a wealth of open-source projects and datasets related to autonomous driving, facilitating quick access to essential materials [25][27]. Group 3: Job Opportunities and Networking - The platform has established a job referral mechanism with multiple autonomous driving companies, allowing members to submit their resumes directly to desired employers [4][11]. - Continuous job sharing and position updates are provided, contributing to a complete ecosystem for autonomous driving professionals [11][14]. - Members can freely ask questions regarding career choices and research directions, receiving guidance from industry experts [75]. Group 4: Technical Focus Areas - The community covers a wide range of technical focus areas including perception, simulation, planning, and control, with detailed learning routes for each [15][29]. - Specific topics such as 3D target detection, BEV perception, and online high-precision mapping are thoroughly organized, reflecting current industry trends and research hotspots [42][48]. - The platform also addresses emerging technologies like visual language models (VLM) and diffusion models, providing insights into their applications in autonomous driving [35][40].
中科院自动化所!视觉-触觉-语言-动作模型方案与数据集制作分享
具身智能之心· 2025-07-30 00:02
Core Viewpoint - The article discusses the development of a Vision-Tactile-Language-Action (VTLA) model aimed at enhancing robot manipulation tasks, particularly in contact-intensive scenarios, by integrating visual and tactile inputs with language instructions [2]. Group 1: Model Development - The VTLA framework addresses the gap in applying visual language models (VLM) to language-conditioned robotic operations, especially beyond visually dominated tasks [2]. - A low-cost multimodal dataset was created in a simulated environment, specifically designed for fingertip insertion tasks, which includes visual-tactile-action-instruction pairs [2]. Group 2: Performance and Results - The VTLA model achieved over 90% success rate on unknown hole types, significantly outperforming traditional imitation learning methods and existing multimodal baselines [2]. - The model's capability was validated through real-world hole axis assembly experiments, demonstrating its superior simulation-to-reality (Sim2Real) transfer ability [2].
看遍奥斯卡后,VLM达到电影摄影理解新SOTA|上海AI Lab开源
量子位· 2025-07-16 01:49
Core Insights - The article discusses the launch of ShotBench, a comprehensive benchmark designed for understanding film language, along with the ShotVL model and the ShotQA dataset, aimed at enhancing visual language models (VLMs) in film comprehension [1][6][15]. Group 1: ShotBench and Its Components - ShotBench includes over 3,500 expert-annotated image and video question-answer pairs from more than 200 acclaimed films, covering eight key dimensions of cinematography [1][8]. - The ShotQA dataset consists of approximately 70,000 question-answer pairs, specifically designed to align models with "cinematic language" [15][19]. - The benchmark framework is structured to evaluate models from a professional cinematographer's perspective, focusing on extracting visual cues and reasoning behind cinematic techniques [8][14]. Group 2: Performance Evaluation - The evaluation of 24 leading VLMs revealed significant limitations, with even the best models achieving an average accuracy below 60%, particularly struggling with fine-grained visual cues and complex spatial reasoning [3][6]. - ShotVL-3B achieved a notable performance improvement of 19% over the baseline model Qwen2.5-VL-3B, establishing new state-of-the-art (SOTA) performance in film language understanding [3][24]. - ShotVL outperformed both the best open-source model (Qwen2.5-VL-72B-Instruct) and proprietary models (GPT-4o) across all dimensions evaluated [3][24]. Group 3: Training Methodology - ShotVL employs a two-phase training process: first, a large-scale supervised fine-tuning (SFT) to acquire broad knowledge, followed by group relative policy optimization (GRPO) for fine-grained reasoning enhancement [15][19][20]. - The first phase utilized approximately 70,000 question-answer pairs from the ShotQA dataset to establish strong alignment between visual features and specific cinematic terms [19]. - The second phase focused on improving reasoning capabilities and prediction accuracy, demonstrating the effectiveness of the GRPO approach [20][28]. Group 4: Key Dimensions of Cinematography - The eight core dimensions covered in ShotBench include Shot Size, Shot Framing, Camera Angle, Lens Size, Lighting Type, Lighting Condition, Composition, and Camera Movement, each critical for understanding film language [11][16][17]. - Each dimension is represented by a substantial number of samples, ensuring comprehensive coverage for model evaluation [17]. Group 5: Open Source Contribution - The team has made the model, data, and code open-source to facilitate rapid development in AI-driven film understanding and generation [4][30].
CEED-VLA:实现VLA模型4倍推理加速,革命性一致性蒸馏与早退解码技术!
具身智能之心· 2025-07-10 13:16
Core Viewpoint - The article discusses the development of a new model called CEED-VLA, which significantly enhances the inference speed of visual-language-action models while maintaining operational performance, making it suitable for high-frequency dexterous tasks [2][30]. Group 1: Model Development - The CEED-VLA model is designed to accelerate inference through a general method that improves performance across multiple tasks [2]. - The model incorporates a consistency distillation mechanism and mixed-label supervision to enable accurate predictions of high-quality actions from various intermediate states [2][6]. - The Early-exit Decoding strategy is introduced to address inefficiencies in the Jacobi decoding process, achieving up to 4.1× inference speedup and over 4.3× execution frequency [2][15]. Group 2: Experimental Results - Simulations and real-world experiments demonstrate that CEED-VLA significantly improves inference efficiency while maintaining similar task success rates [6][30]. - The model shows a speedup of 2.00× compared to the teacher model and achieves a higher number of fixed tokens, indicating improved performance [19][20]. - In real-world evaluations, CEED-VLA successfully completes dexterous tasks, achieving a success rate exceeding 70% due to enhanced inference speed and control frequency [30][31].
AI 开始「自由玩电脑」了!吉大提出「屏幕探索者」智能体
机器之心· 2025-06-27 04:02
Core Viewpoint - The article discusses the development of a vision-language model (VLM) agent named ScreenExplorer, which is designed to autonomously explore and interact within open graphical user interface (GUI) environments, marking a significant step towards achieving general artificial intelligence (AGI) [2][3][35]. Group 1: Breakthroughs and Innovations - The research introduces three core breakthroughs in the training of VLM agents for GUI exploration [6]. - A real-time interactive online reinforcement learning framework is established, allowing the VLM agent to interact with a live GUI environment [8][11]. - The introduction of a "curiosity mechanism" addresses the sparse feedback issue in open GUI environments, motivating the agent to explore diverse interface states [10][12]. Group 2: Training Methodology - The training involves a heuristic and world model-driven reward system that encourages exploration by providing immediate rewards for diverse actions [12][24]. - The GRPO algorithm is utilized for reinforcement learning training, calculating the advantage of actions based on rewards obtained [14][15]. - The training process allows for multiple parallel environments to synchronize reasoning, execution, and recording, enabling "learning by doing" [15]. Group 3: Experimental Results - Initial experiments show that without training, the Qwen2.5-VL-3B model fails to interact effectively with the GUI [17]. - After training, the model demonstrates improved capabilities, successfully opening applications and navigating deeper into pages [18][20]. - The ScreenExplorer models outperform general models in exploration diversity and interaction effectiveness, indicating a significant advancement in autonomous GUI interaction [22][23]. Group 4: Skill Emergence and Conclusion - The training process leads to the emergence of new skills, such as cross-modal translation and complex reasoning abilities [29][34]. - The research concludes that ScreenExplorer effectively enhances GUI interaction capabilities through a combination of exploration rewards, world models, and GRPO reinforcement learning, paving the way for more autonomous agents and progress towards AGI [35].