Workflow
自动驾驶之心
icon
Search documents
顶级四校联手打造OmniVGGT:全模态视觉几何Transformer!
自动驾驶之心· 2025-11-17 00:05
Core Insights - The article discusses the need for a "universal multimodal" 3D model, highlighting the limitations of current models that primarily rely on RGB images and fail to utilize additional geometric information effectively [5][6][9]. - The proposed OmniVGGT framework allows for flexible integration of any number of auxiliary geometric modalities during training and inference, significantly improving performance across various 3D tasks [6][9][10]. Group 1: Need for Universal Multimodal 3D Models - Current mainstream 3D models, such as VGGT, can only process RGB images and do not utilize depth or camera parameters, leading to inefficiencies in real-world applications [5]. - OmniVGGT addresses the issue of "information waste" and poor adaptability by fully leveraging available auxiliary information without compromising performance when only RGB input is used [9][10]. Group 2: Core Innovations of OmniVGGT - OmniVGGT achieves top-tier performance in tasks like monocular/multi-view depth estimation and camera pose estimation, even outperforming existing methods with just RGB input [7][29]. - The framework integrates into visual-language-action (VLA) models, significantly enhancing robotic operation tasks [7][29]. Group 3: Technical Components - The GeoAdapter component injects geometric information (depth, camera parameters) into the base model without disrupting the original feature space, maintaining low computational overhead [10][16]. - A random multimodal fusion strategy is employed during training to ensure the model learns robust spatial representations and does not overly depend on auxiliary information [22][23]. Group 4: Experimental Results - OmniVGGT was trained on 19 public datasets, demonstrating superior performance across multiple 3D tasks, with significant improvements in metrics such as absolute relative error and accuracy [29][30]. - The framework shows that the more auxiliary information is provided, the better the performance, with notable enhancements in depth estimation and camera pose accuracy [30][34]. Group 5: Practical Implications - OmniVGGT's design allows for flexible input combinations of auxiliary geometric modalities, making it practical for various applications in 3D modeling and robotics [53][54]. - The model's efficiency and speed, requiring only 0.2 seconds for inference, position it as a leading solution in the field [42][40].
特斯拉3D重建可以参考的前馈GS算法有哪些?
自动驾驶之心· 2025-11-17 00:05
作者 | 林芝米林@知乎 编辑 | 自动驾驶之心 原文链接: https://zhuanlan.zhihu.com/p/1926671515228808957 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 推荐最新一篇有关Feed-Forward 3D Reconstruction的综述文章[2507.14501] Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey -----------------------------------------手动分割线----------------------------------------- 最近在用3DGS做三维重建,过去per-scene optimization的方法实在用起来不方便,所以开始对feed-forward optimization的方法产生兴趣。下文总结了最新的一些有关 feed-forward 3DGS研究 ...
三个月手搓了一辆自动驾驶全栈小车
自动驾驶之心· 2025-11-17 00:05
Core Viewpoint - The article announces the launch of the "Black Warrior 001," a comprehensive autonomous driving educational vehicle aimed at research and teaching, now available for pre-sale at a discounted price of 36,999 yuan, including three free courses on model deployment, point cloud 3D detection, and multi-sensor fusion [1]. Group 1: Product Overview - The Black Warrior 001 is a lightweight solution developed by the Autonomous Driving Heart team, supporting various functionalities such as perception, positioning, fusion, navigation, and planning, built on an Ackermann chassis [2]. - The vehicle allows for secondary development and modification, with numerous installation positions and interfaces for adding cameras, millimeter-wave radars, and other sensors [3]. Group 2: Target Audience - The product is suitable for undergraduate students for learning and competitions, graduate students for research and publications, and can be used as teaching tools in university laboratories and vocational training institutions [5]. Group 3: Performance Demonstration - The vehicle has been tested in various environments, including indoor, outdoor, and parking garage scenarios, showcasing its capabilities in perception, positioning, fusion, navigation, and planning [6]. Group 4: Hardware Specifications - Key sensors include: - 3D LiDAR: Mid 360 - 2D LiDAR: Lidar from Raystar - Depth Camera: Orbbec with IMU - Main Control Chip: Nvidia Orin NX 16G - Display: 1080p [22][23]. - The vehicle's weight is 30 kg, with a battery power of 50W, a supply voltage of 24V, and a maximum speed of 2 m/s [25][26]. Group 5: Software and Functionality - The software framework includes ROS, C++, and Python, supporting one-click startup and providing a development environment [28]. - The vehicle features various functionalities such as 2D and 3D SLAM, point cloud processing, vehicle navigation, and obstacle avoidance [29]. Group 6: After-Sales Support - The company offers one year of after-sales support (excluding human damage), with free repairs for damages caused by operational errors or code modifications during the warranty period [51].
FSD v14里面藏了VLA吗?谁在定义自动驾驶下一代方案:VLA vs WA的一场深入探讨......
自动驾驶之心· 2025-11-17 00:05
Core Insights - The article discusses the rising discussions around the next-generation solutions for autonomous driving, focusing on the concepts of VLA (Vision-Language Alignment) and world models [2][3]. Group 1: Event Overview - A significant roundtable discussion is planned, bringing together diverse perspectives from academia and industry to explore VLA and world models in depth [3]. - The event will cover various forms of world models and VLA, advancements in industry applications, and the potential integration of both concepts [3]. Group 2: Key Speakers - Xu Lingyun, a PhD from the Chinese Academy of Sciences and a postdoctoral researcher at Carnegie Mellon University, will share insights on intelligent driving algorithms and has led multiple production projects [4]. - Jiang Anqing, a PhD from Waseda University and a senior algorithm scientist at Bosch, will discuss VLA and closed-loop algorithms [5]. - Zhang Zhipeng, founder of AutoLab at Shanghai Jiao Tong University, will contribute as an assistant professor and PhD advisor [5]. Group 3: Discussion Topics - The event will feature discussions on the essence of world models by NIO's Ren Shaoqing, Tesla's latest advancements in FSD (Full Self-Driving) at ICCV, and the training closed-loop of world models by Li Auto [6]. - Highlights will include the impressive real-world performance of Horizon Robotics' HSD and NVIDIA's latest VLA work [6]. Group 4: Future Directions - The article raises questions about whether the future of autonomous driving will move towards a unified integration of world models and VLA [11]. - It also addresses the increasing demand for data and computing power, which poses challenges for academic participation in autonomous driving research [11].
秋招太难了,坚持!就有好日子了......
自动驾驶之心· 2025-11-15 16:04
秋招从 7 月的提前批开始,到现在也快过了 4个月。大家普遍挺焦虑的,行情确实不好,应届生找 工作难,能拿到一个差强人意的offer就不错了。 但我想说,金九银十是"卷王"的主场—— 20%的 人拿走了80%的offer。现在这个阶段,其实才真正属于普通人的主战场。 首先可以蹲蹲直接身边比较好的大学,很多企业去线下摆摊招聘,到时候去他们线下面试,获得 面试机会。通常来说, 线下招聘的推进速度会很快,有不少公司,线下面试一天就给你发 of er 。 早就是优势,简历就是底气 最重要的一点,早点准备你的简历内容,优势会大很多。 投过的同学都知道 ,学历 没有优势的情况下投大公司的基本石沉大海。这也是没办法的,毕竟面 试是有时间成本的, 如何在学历不突出的情况下超过背景更好的竞争者(985、c9高校),斩获 offer? 最直接的方法是简历上有足够亮眼的论文、项目成果。 突出个人优势,证明自己的学习和实操能 力。 点击咨询匹配大牛导师 阿里星大佬简历模版 如果你处于以下情况: 想转算法岗苦于没有论文; 毕业在即,达不到老师毕业要求; 你更要善于在关键节点发力,清楚最后10%的打磨决定90%的分数差距, 并懂得借助前 ...
楼天城:VLA帮不了L4
自动驾驶之心· 2025-11-15 16:04
Core Insights - The article discusses the advancements in autonomous driving technology, particularly focusing on the transition from Level 2 (L2) to Level 4 (L4) autonomous vehicles, emphasizing the complexity and safety challenges involved in achieving L4 autonomy [5][19][21]. Group 1: Technological Advancements - PonyWorld, a world model technology, enhances the safety of Robotaxi, making it ten times safer than human drivers [9]. - The cost of the autonomous driving kit has decreased by 70% compared to previous generations, with all components now being vehicle-grade [8][30]. - The integration of perception, prediction, and control into an end-to-end model has been achieved, which is now standard for L4 vehicles and a requirement for L2 vehicles [15][16]. Group 2: Learning Models - The article highlights two learning modes: imitation learning, which is quick but limits the learner's potential, and reinforcement learning, which allows for exploration and surpassing the teacher [12]. - L4 companies are evolving through reinforcement learning, while L2 remains within the bounds of imitation learning [12][21]. Group 3: Market and Product Development - The transition to L4 technology for personal vehicles is expected to take longer than anticipated, with significant operational and regulatory challenges still to be addressed [22]. - The Robotaxi fleet has accumulated over 500,000 hours of operation, indicating a significant step towards practical deployment [29]. - The company aims to achieve cost reduction through vehicle-grade components and eliminating the need for human drivers, marking a significant milestone in the development of autonomous vehicles [33]. Group 4: Industry Perspectives - The article discusses the limitations of Vision-Language Models (VLA) in L4 applications, suggesting that specialized models are necessary for the extreme safety requirements of autonomous driving [17]. - The author compares the current state of embodied intelligence to the state of autonomous driving in 2018, indicating a similar need for patience and long-term development [26].
扩散语言模型的潜力被严重低估了!新国立发现可全面超越自回归
自动驾驶之心· 2025-11-15 16:04
Core Insights - The article discusses the emergence of Diffusion Language Models (DLM) as a new paradigm in language modeling, showcasing their ability to learn more effectively under data constraints compared to traditional Autoregressive (AR) models [2][5][6]. Research Background - Autoregressive language models are currently the mainstream in large-scale language modeling, but high-quality data has become a significant bottleneck for model expansion [3]. - In scenarios with limited data, the ability to extract more information from each unique token becomes crucial, indicating that data, rather than computation, is the limiting factor [4]. Crossover Phenomenon - The research defines a "Crossover" point where DLM models surpass AR models in performance under limited data conditions, demonstrating approximately three times the data efficiency of AR models [5]. - Factors influencing the timing of this crossover include data quantity and quality, as well as model size [8]. Experimental Results - Under lower data budgets, DLM significantly outperforms AR models, achieving comparable performance with fewer unique tokens [13]. - The quality of data also plays a critical role, with higher quality data delaying the crossover point for DLM models [16]. - Increasing model size leads to an earlier crossover point, as AR models quickly saturate under data constraints, while DLM continues to improve with larger sizes [19]. Computational Efficiency - DLM models consistently outperform AR models across various sparsity levels, with denser architectures yielding better performance, especially under data constraints [22]. - The introduction of noise in the training process enhances DLM's performance, while AR models struggle to maintain performance under high noise levels [26]. Large-Scale Token Training - The research validates the crossover phenomenon in large-scale unique token datasets, particularly in generation tasks, indicating that DLM retains significant untapped potential even after extensive training [31]. - The performance of DLM remains robust even with extreme data repetition, suggesting its ability to extract more information from fixed datasets [40]. Overfitting and Model Behavior - DLM may exhibit overfitting in scenarios with limited unique data and larger model sizes, but performance degradation typically occurs later in the training process [43]. - The absence of strict causal biases in DLM allows for better modeling of complex data patterns, enhancing its learning capabilities [44]. Future Directions - While DLM shows promise, challenges remain in ensuring data security and privacy, particularly in dense training scenarios, and the architecture for practical deployment is still less mature compared to AR models [46].
一见Auto说理想对2起质量事故内部问责处理18人
自动驾驶之心· 2025-11-15 11:58
Core Viewpoint - The article discusses the internal accountability measures taken by Li Auto in response to two significant quality incidents, highlighting the responsibilities of various personnel involved in the MEGA battery recall and the L-series lower arm issues [1][2]. Group 1: Internal Accountability - Li Auto has held 18 employees accountable for insufficient verification of coolant in the MEGA battery recall, with primary responsibility attributed to personnel in R&D operations and materials technology [1]. - Additional accountability was assigned to personnel in electric powertrain and battery testing for inadequate risk assessment regarding leakage [1]. - Four employees were specifically held accountable for insufficient verification of grease testing related to the L-series lower arm [1]. Group 2: Company Culture and Leadership - The HR head's direct reporting to Li Xiang indicates a stronger focus on implementing company values and addressing existing issues within the organization [2]. - The core essence of Li Auto's values emphasizes creating user value through innovative thinking and a commitment to scientific methodologies, avoiding unethical practices for success [2]. - There is a notable departure of employees who align with the company's values, leading to a mix of those who remain feeling unproductive and somewhat demoralized [2]. Group 3: Future Outlook - There is an expectation that Li Auto's values will eventually be restored, although the timeline for this recovery remains uncertain [2]. - Li Xiang's strong motivation and unique shareholding structure provide a foundation for overcoming challenges and driving the company forward despite setbacks [2].
招募自动驾驶产品经理/强化学习方向合伙人!
自动驾驶之心· 2025-11-15 03:03
Core Viewpoint - The article emphasizes the need for deeper technical exploration and collaboration in the autonomous driving industry, highlighting the importance of addressing the challenges and pain points within the sector [2]. Group 1: Industry Direction - The main focus areas for development include but are not limited to: autonomous driving product management, 4D annotation/data loop, world models, VLA, autonomous driving large models, reinforcement learning, and end-to-end systems [4]. Group 2: Job Description - The positions are primarily aimed at autonomous driving training collaborations, targeting both B-end (enterprises, universities, research institutes) and C-end (students, job seekers) audiences for training, course development, and original article creation [5]. Group 3: Collaboration Invitation - The industry is calling for more talented individuals to join and contribute to the advancement of autonomous driving technology [3]. Group 4: Contact Information - For further communication regarding compensation and collaboration methods, interested parties are encouraged to add the WeChat contact provided [6].
万字长文总结多模态大模型最新进展(Modality Bridging篇)
自动驾驶之心· 2025-11-15 03:03
Core Insights - The article discusses the emergence of Multimodal Large Language Models (MLLMs) as a significant research focus, highlighting their capabilities in performing multimodal tasks such as story generation from images and mathematical reasoning without OCR, indicating a potential pathway towards general artificial intelligence [2][4]. Group 1: MLLM Architecture and Training - MLLMs typically undergo large-scale pre-training on paired data to align different modalities, using datasets like image-text pairs or automatic speech recognition (ASR) datasets [2]. - The Perceiver Resampler module maps variable-sized spatiotemporal visual features from a vision encoder to a fixed number of visual tokens, reducing computational complexity in visual-text cross-attention [6][8]. - The training process involves a two-phase strategy: the first phase focuses on visual-language representation learning from frozen image encoders, while the second phase guides visual-to-language generation learning from frozen LLMs [22][24]. Group 2: Instruction Tuning and Data Efficiency - Instruction tuning is crucial for enhancing the model's ability to follow user instructions, with the introduction of learned queries that interact with both visual and textual features [19][26]. - The article emphasizes the importance of diverse and high-quality instruction data to improve model performance across various tasks, including visual question answering (VQA) and OCR [44][46]. - Data efficiency experiments indicate that reducing the training dataset size can still maintain high performance, suggesting potential for further improvements in data utilization [47]. Group 3: Model Improvements and Limitations - LLaVA-NeXT shows improvements in reasoning, OCR, and world knowledge, surpassing previous models in several benchmarks [40]. - Despite advancements, limitations remain, such as the model's inability to handle multiple images effectively and the potential for generating hallucinations in critical applications [39][46]. - The article discusses the need for efficient sampling methods and the balance between data annotation quality and model processing capabilities to mitigate hallucinations [48].