Workflow
视觉语言模型(VLM)
icon
Search documents
又帮到了一位同学拿到了自动驾驶算法岗......
自动驾驶之心· 2025-08-23 14:44
最近有个开学即将研三的同学找柱哥诉苦,同门都在转具身智能,或者打算主攻大模型、Agent之类的互联网的 大厂,自己还在搞自动驾驶算法。从去年开始,行业就开始出现诸多裁员的消息,明年秋招了有些迷茫。。。最 开始大家都是做感知相关的,慢慢开始有些区别了。想问下自己是继续投身智驾行业,还是考虑转行。 这两天 才刚看到我们自动驾驶之心的社区,体系很完整,就怕有些晚了。 "什么时候都不算太晚。" 况且你还有时间聚焦在一些技术壁垒更高的方向,像VLA或者端到端后面转大模型或 者具身也更容易,不用太担心。尽快把自己的技术栈扩展和打牢才是重中之重。 如果你没有较强独立学习和搜 索问题的能力,可以来我们的自驾社区,也是目前国内最大最全的自驾学习平台【自动驾驶之心】知识星球。 "自动驾驶之心知识星球"目前集视频 + 图文 + 学习路线 + 问答 + 求职交流为一体,是一个综合类的自驾社区, 超过4000人了。 我们期望未来2年内做到近万人的规模。给大家打造一个交流+技术分享的聚集地,是许多初学 者和进阶的同学经常逛的地方。 社区内部还经常为大家解答各类实用问题:端到端如何入门?自动驾驶多模态大模型如何学习?自动驾驶VLA 的学习 ...
理想VLA到底是不是真的VLA?
自动驾驶之心· 2025-08-21 23:34
Core Viewpoint - The article discusses the capabilities of the MindVLA model in autonomous driving, emphasizing its advanced scene understanding and decision-making abilities compared to traditional E2E models. Group 1: VLA Capabilities - The VLA model demonstrates effective defensive driving, particularly in scenarios with obstructed views, by smoothly adjusting speed based on remaining distance [4][5]. - In congested traffic situations, VLA shows improved decision-making by choosing to change lanes rather than following the typical detour logic of E2E models [7]. - The VLA model exhibits enhanced lane centering abilities in non-standard lane widths, significantly reducing the occurrence of erratic driving patterns [9][10]. Group 2: Scene Understanding - VLA's decision-making process reflects a deeper understanding of traffic scenarios, allowing it to make more efficient lane changes and route selections [11]. - The model's ability to maintain stability in trajectory generation is attributed to its use of diffusion models, which enhances its performance in various driving conditions [10]. Group 3: Comparison with E2E Models - The article highlights that E2E models struggle with nuanced driving behaviors, often resulting in abrupt maneuvers, while VLA provides smoother and more context-aware driving responses [3][4]. - VLA's architecture allows for parallel optimization across different scenarios, leading to faster iterations and improvements compared to E2E models [12]. Group 4: Limitations and Future Considerations - Despite its advancements, VLA is still classified as an assistive driving technology rather than fully autonomous driving, requiring human intervention in certain situations [12]. - The article raises questions about the model's performance in specific scenarios, indicating areas for further development and refinement [12].
死磕技术的自动驾驶黄埔军校,4000人了!
自动驾驶之心· 2025-08-15 14:23
Core Viewpoint - The article emphasizes the establishment of a comprehensive community focused on autonomous driving, aiming to bridge the gap between academia and industry while providing valuable resources for learning and career opportunities in the field [2][16]. Group 1: Community and Resources - The community has created a closed-loop system covering various fields such as industry, academia, job seeking, and Q&A exchanges, enhancing the learning experience for participants [2][3]. - The platform offers cutting-edge academic content, industry roundtables, open-source code solutions, and timely job information, significantly reducing the time needed for research [3][16]. - Members can access nearly 40 technical routes, including industry applications, VLA benchmarks, and entry-level learning paths, catering to both beginners and advanced researchers [3][16]. Group 2: Learning and Development - The community provides a well-structured learning path for beginners, including foundational knowledge in mathematics, computer vision, deep learning, and programming [10][12]. - For those already engaged in research, valuable industry frameworks and project proposals are available to further their understanding and application of autonomous driving technologies [12][14]. - Continuous job sharing and career opportunities are promoted within the community, fostering a complete ecosystem for autonomous driving [14][16]. Group 3: Technical Focus Areas - The community has compiled extensive resources on various technical aspects of autonomous driving, including perception, simulation, planning, and control [16][17]. - Specific learning routes are available for topics such as end-to-end learning, 3DGS principles, and multi-modal large models, ensuring comprehensive coverage of the field [16][17]. - The platform also features a collection of open-source projects and datasets relevant to autonomous driving, facilitating hands-on experience and practical application [32][34].
自动驾驶VLA工作汇总(模块化/端到端/推理增强)
自动驾驶之心· 2025-08-12 11:42
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 VLA前置工作:VLM作为解释器 论文标题:DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model 论文链接:https://arxiv.org/abs/2310.01412 主页:https://tonyxuqaq.github.io/projects/DriveGPT4/ 论文标题:TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning 论文链接:https://arxiv.org/abs/2505.12670 主页:https://github.com/AiX-Lab-UWO/TS-VLM 论文标题:DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision- L ...
本来决定去具身,现在有点犹豫了。。。
自动驾驶之心· 2025-08-11 12:17
Core Insights - Embodied intelligence is a hot topic this year, transitioning from previous years' silence to last year's frenzy, and now gradually cooling down as the industry realizes that embodied robots are far from being productive [1] Group 1: Industry Trends - The demand for multi-sensor fusion and positioning in robotics is significant, with a focus on SLAM and ROS technologies [3] - Many robotics companies are rapidly developing and have secured considerable funding, indicating a promising future for the sector [3] - Traditional robotics remains the main product line, despite the excitement around embodied intelligence [3] Group 2: Community and Resources - The community has established a closed loop across various fields including industry, academia, and job seeking, aiming to create a valuable exchange platform [4][6] - The community offers access to over 40 technical routes and invites industry leaders for discussions, enhancing learning and networking opportunities [6][20] - Members can freely ask questions regarding job choices or research directions, receiving guidance from experienced professionals [83] Group 3: Educational Content - Comprehensive resources for beginners and advanced learners are available, including technical stacks and learning roadmaps for autonomous driving and robotics [13][16] - The community has compiled a list of notable domestic and international research labs and companies in the autonomous driving and robotics sectors, aiding members in their academic and career pursuits [27][29]
「一只手有几根手指」,你的GPT-5答对了吗?
机器之心· 2025-08-11 10:40
Core Viewpoint - The article discusses the limitations of advanced language models like GPT-5 in understanding basic visual concepts, highlighting the need for vision-centric models to improve visual comprehension and reasoning capabilities [2][26]. Group 1 - Tairan He points out that while language is a powerful tool, it struggles to fully meet the needs of the visual and robotics fields [2]. - There is a call for the development of vision-centric language models (VLM) and vision-language-action (VLA) models to address these shortcomings [3]. - The ambiguity in the definition of "fingers" illustrates the challenges language models face in interpreting visual information accurately [4][6]. Group 2 - The article mentions that even top models like Gemini 2.5 Pro have failed to provide correct answers to basic questions, indicating a lack of robust visual understanding [10][24]. - Tairan He references a paper by the Sseynin team that proposes a rigorous evaluation method for assessing the visual capabilities of multimodal large language models (MLLM) [28]. - The new benchmark test, CV-Bench, focuses on evaluating models' abilities in object counting, spatial reasoning, and depth perception, establishing stricter assessment standards [31]. Group 3 - Research shows that while advanced VLMs can achieve 100% accuracy in recognizing common objects, their performance drops to about 17% when dealing with counterfactual images [33]. - The article emphasizes that VLMs rely on memorized knowledge rather than true visual analysis, which limits their effectiveness [34]. - Martin Ziqiao Ma argues that initializing VLA models with large language models is a tempting but misleading approach, as it does not address fundamental perception issues [36].
自动驾驶二十年,这个自动驾驶黄埔军校一直在精打细磨...
自动驾驶之心· 2025-08-09 16:03
Core Viewpoint - The article emphasizes the ongoing evolution and critical phase of the autonomous driving industry, highlighting the transition from modular approaches to end-to-end/VLA methods, and the community's commitment to fostering knowledge and collaboration in this field [2][4]. Group 1: Industry Development - Since Google's initiation of autonomous driving technology research in 2009, the industry has progressed significantly, now entering a crucial phase of development [2]. - The community aims to integrate intelligent driving into daily transportation, reflecting a growing expectation for advancements in autonomous driving capabilities [2]. Group 2: Community Initiatives - The community has established a knowledge-sharing platform, offering resources across various domains such as industry insights, academic research, and job opportunities [2][4]. - Plans to enhance community engagement include monthly online discussions and roundtable interviews with industry and academic leaders [2]. Group 3: Educational Resources - The community has compiled over 40 technical routes to assist individuals at different levels, from beginners to those seeking advanced knowledge in autonomous driving [4][16]. - A comprehensive entry-level technical stack and roadmap have been developed for newcomers to the field [9]. Group 4: Job Opportunities and Networking - The community has established internal referral mechanisms with multiple autonomous driving companies, facilitating job placements for members [7][14]. - Continuous job sharing and networking opportunities are provided to create a complete ecosystem for autonomous driving professionals [14][80]. Group 5: Research and Technical Focus - The community has gathered extensive resources on various research areas, including 3D target detection, BEV perception, and multi-sensor fusion, to support practical applications in autonomous driving [16][30][32]. - Detailed summaries of cutting-edge topics such as end-to-end driving, world models, and visual language models (VLM) have been compiled to keep members informed about the latest advancements [34][40][42].
自动驾驶大模型方案:视觉语言模型VLM工作一览,面向量产和研究~
自动驾驶之心· 2025-08-06 23:34
Core Insights - The article emphasizes the transformative potential of Vision-Language Models (VLMs) in enhancing the perception and cognitive capabilities of autonomous driving systems, enabling them to not only "see" but also "understand" complex driving environments [2][3]. Group 1: VLM Applications in Autonomous Driving - VLMs can surpass traditional visual models by integrating camera images or video streams to comprehend semantic information in traffic scenes, such as recognizing complex scenarios like "a pedestrian waving to cross the street" [6]. - VLMs facilitate the conversion of intricate visual scenes into clear natural language descriptions, enhancing the interpretability of decisions made by autonomous systems, which aids in debugging and increases trust among passengers and regulators [6]. - VLMs are crucial for natural language interactions in future smart cabins, allowing passengers to communicate intentions to vehicles through spoken commands [6]. Group 2: Scenario Generation and Testing - The article introduces CrashAgent, a multi-agent framework that utilizes multi-modal large language models to convert accident reports into structured scenarios for simulation environments, addressing the long-tail distribution issue in existing datasets [7]. - CurricuVLM is proposed as a personalized curriculum learning framework that leverages VLMs to analyze agent behavior and dynamically generate tailored training scenarios, improving safety in autonomous driving [13]. - TRACE is a framework that generates key test cases from real accident reports, significantly enhancing the efficiency of defect detection in autonomous driving systems [17]. Group 3: Out-of-Distribution (OOD) Scenario Generation - A framework utilizing large language models is proposed to generate diverse OOD driving scenarios, addressing the challenges posed by the sparsity of such scenarios in urban driving datasets [21][22]. - The article discusses the development of a method to automatically convert real-world driving videos into detailed simulation scenarios, enhancing the testing of autonomous driving systems [26]. Group 4: Enhancing Safety and Robustness - WEDGE is introduced as a synthetic dataset created from generative vision-language models, aimed at improving the robustness of perception systems in extreme weather conditions [39][40]. - LKAlert is a predictive alert system that utilizes VLMs to forecast potential lane-keeping assist (LKA) risks, enhancing driver situational awareness and trust [54][55]. Group 5: Advancements in Decision-Making Frameworks - The CBR-LLM framework combines semantic scene understanding with case retrieval to enhance decision-making in complex driving scenarios, improving accuracy and reasoning consistency [44][45]. - ORION is presented as a holistic end-to-end autonomous driving framework that integrates visual-language instructed action generation, achieving superior performance in closed-loop evaluations [69][70].
4000人了,死磕技术的自动驾驶黄埔军校到底做了哪些事情?
自动驾驶之心· 2025-07-31 06:19
Core Viewpoint - The article emphasizes the importance of creating an engaging learning environment in the field of autonomous driving and AI, aiming to bridge the gap between industry and academia while providing valuable resources for students and professionals [1]. Group 1: Community and Resources - The community has established a closed loop across various fields including industry, academia, job seeking, and Q&A exchanges, focusing on what type of community is needed [1][2]. - The platform offers cutting-edge academic content, industry roundtables, open-source code solutions, and timely job information, streamlining the search for resources [2][3]. - A comprehensive technical roadmap with over 40 technical routes has been organized, catering to various interests from consulting applications to the latest VLA benchmarks [2][14]. Group 2: Educational Content - The community provides a series of original live courses and video tutorials covering topics such as automatic labeling, data processing, and simulation engineering [4][10]. - Various learning paths are available for beginners, as well as advanced resources for those already engaged in research, ensuring a supportive environment for all levels [8][10]. - The community has compiled a wealth of open-source projects and datasets related to autonomous driving, facilitating quick access to essential materials [25][27]. Group 3: Job Opportunities and Networking - The platform has established a job referral mechanism with multiple autonomous driving companies, allowing members to submit their resumes directly to desired employers [4][11]. - Continuous job sharing and position updates are provided, contributing to a complete ecosystem for autonomous driving professionals [11][14]. - Members can freely ask questions regarding career choices and research directions, receiving guidance from industry experts [75]. Group 4: Technical Focus Areas - The community covers a wide range of technical focus areas including perception, simulation, planning, and control, with detailed learning routes for each [15][29]. - Specific topics such as 3D target detection, BEV perception, and online high-precision mapping are thoroughly organized, reflecting current industry trends and research hotspots [42][48]. - The platform also addresses emerging technologies like visual language models (VLM) and diffusion models, providing insights into their applications in autonomous driving [35][40].
中科院自动化所!视觉-触觉-语言-动作模型方案与数据集制作分享
具身智能之心· 2025-07-30 00:02
Core Viewpoint - The article discusses the development of a Vision-Tactile-Language-Action (VTLA) model aimed at enhancing robot manipulation tasks, particularly in contact-intensive scenarios, by integrating visual and tactile inputs with language instructions [2]. Group 1: Model Development - The VTLA framework addresses the gap in applying visual language models (VLM) to language-conditioned robotic operations, especially beyond visually dominated tasks [2]. - A low-cost multimodal dataset was created in a simulated environment, specifically designed for fingertip insertion tasks, which includes visual-tactile-action-instruction pairs [2]. Group 2: Performance and Results - The VTLA model achieved over 90% success rate on unknown hole types, significantly outperforming traditional imitation learning methods and existing multimodal baselines [2]. - The model's capability was validated through real-world hole axis assembly experiments, demonstrating its superior simulation-to-reality (Sim2Real) transfer ability [2].