diffusion多模态轨迹输出 - filings, earnings calls, financial reports, news

diffusion多模态轨迹输出

Search documents

自动驾驶之心· 2025-09-07 23:34

Core Viewpoint - The article discusses the evolution and current state of autonomous driving paradigms, focusing on the transition from end-to-end systems to Vision-Language-Action (VLA) frameworks and the challenges faced in achieving effective multi-modal trajectory outputs [2][3][11]. Group 1: End-to-End Systems - The end-to-end autonomous driving network directly maps raw sensor inputs to control commands, eliminating traditional processing steps and maximizing information retention [4]. - Iterative practices in engineering involve clustering bad cases and retraining models, but this often leads to new issues arising from updates [8]. - Tesla's "daily update model" offers a solution by continuously evolving the model through the integration of bad cases into training samples [9]. Group 2: Emergence of Dual Systems - The introduction of large language models (LLMs) has led to the rapid adoption of the "end-to-end + VLM" dual system approach, which enhances generalization in zero-shot and few-shot scenarios [11]. - Early VLMs focused on recognizing specific semantics, and the EMMA architecture incorporates reasoning to assist in vehicle control [12]. Group 3: VLA and Diffusion Framework - The VLA framework outputs driving commands that are processed by a diffusion decoder to generate safe and smooth vehicle trajectories [16]. - Current challenges in the VLA + diffusion architecture include subpar multi-modal trajectory outputs, the "brain split" issue between VLA and diffusion systems, and the quality of single-modal trajectories [18][19]. - The alignment of language and action (LA alignment) remains a critical challenge, as the practical value of language models in autonomous driving is still uncertain [19]. Group 4: Future Directions - Future work should focus on scalable system solutions that leverage data advantages and enhance the capabilities of foundational models through reinforcement learning [20][22]. - The "generate + score" paradigm has proven effective in other domains, and the next steps involve optimizing trajectory quality through self-reflection mechanisms [22].

VLA（Vision - Language - Action）

VLA（Vision - Language - Action）