ColaVLA - filings, earnings calls, financial reports, news

ColaVLA

Search documents

滴滴最近在加速了！ColaVLA：潜在认知推理的分层并行VLA框架（清华&港中文&滴滴）

自动驾驶之心· 2025-12-30 09:20

Core Insights - The article discusses the development of ColaVLA, a unified visual-language-action framework for autonomous driving that enhances trajectory planning by leveraging cognitive latent reasoning and hierarchical parallel planning [4][10][50]. Group 1: Background and Challenges - Traditional autonomous driving systems separate perception, prediction, and planning into distinct modules, while recent end-to-end (E2E) systems integrate these tasks into a unified learning pipeline [3][6]. - Visual-language models (VLMs) have been increasingly integrated into autonomous driving systems to inject cross-modal prior knowledge and world knowledge, but they face three core challenges: modal mismatch, high latency from autoregressive reasoning, and inefficiencies in planner design [7][9]. Group 2: ColaVLA Framework - ColaVLA proposes a unified framework that shifts the reasoning process from explicit text-based chains to a unified latent variable space, combined with a hierarchical parallel trajectory decoder [10][18]. - The cognitive latent reasoning component efficiently completes scene understanding and decision-making through two forward propagations, extracting decision-relevant information from multimodal inputs [11][21]. - The hierarchical parallel planner generates multi-scale trajectories in a single forward pass, maintaining causal structure and significantly reducing reasoning latency [12][28]. Group 3: Experimental Results - ColaVLA achieved state-of-the-art performance on the nuScenes benchmark, with the lowest average L2 error of 0.30 meters and a collision rate of 0.23%, outperforming existing action-based methods [37][38]. - In closed-loop evaluations, ColaVLA reached a NeuroNCAP score of 3.48, significantly improving safety metrics by reducing average collision rates from 65.1% to 36.8% [39][40]. - The framework demonstrated over five times the reasoning speed compared to traditional text-based autoregressive models, showcasing its efficiency and robustness [40][41].