自动驾驶之心
Search documents
推荐一个大模型AI私房菜!
自动驾驶之心· 2025-08-23 16:03
Group 1 - The article emphasizes the growing interest in large model technologies, particularly in areas such as RAG (Retrieval-Augmented Generation), AI Agents, multimodal large models (pre-training, fine-tuning, reinforcement learning), and optimization for deployment and inference [1] - A community named "Large Model Heart Tech" is being established to focus on these technologies and aims to become the largest domestic community for large model technology [1] - The community is also creating a knowledge platform to provide industry and academic information, as well as to cultivate talent in the field of large models [1] Group 2 - The article describes the community as a serious content-driven platform aimed at nurturing future leaders [2]
面向量产VLA!FastDriveVLA:即插即用剪枝模块,推理加速近4倍
自动驾驶之心· 2025-08-23 16:03
Core Viewpoint - The article discusses the development of FastDriveVLA, a novel visual token pruning framework designed for autonomous driving, achieving a 50% compression rate while maintaining 97.3% performance [3][13][43]. Group 1: End-to-End Autonomous Driving - Recent advancements in end-to-end autonomous driving research have led to the adoption of visual-language-action (VLA) models, which outperform traditional modular approaches in complex scene understanding and decision-making [3][10]. - The VLA model integrates perception, action generation, and planning into a single framework, reducing information loss between modules [3][4]. Group 2: Visual Token Pruning Techniques - Existing VLM/VLA models face high computational costs due to the encoding of images into numerous visual tokens, prompting research into visual token pruning methods [4][11]. - Two primary approaches for visual token pruning are attention mechanism-based methods and similarity-based methods, both of which have limitations in driving tasks [4][14]. - FastDriveVLA introduces a reconstruction-based visual token pruning framework that focuses on retaining tokens related to foreground areas critical for driving decisions [5][13]. Group 3: FastDriveVLA Framework - FastDriveVLA employs a plug-and-play pruner called ReconPruner, trained using a pixel reconstruction task to emphasize foreground information [6][17]. - The framework includes an adversarial foreground-background reconstruction strategy to enhance the model's ability to distinguish between foreground and background tokens [20][21]. - A large-scale dataset, nuScenes-FG, was constructed to support the training of ReconPruner, containing 241,000 image-mask pairs for effective foreground segmentation [6][12][13]. Group 4: Experimental Results - FastDriveVLA achieved state-of-the-art results on the nuScenes closed-loop planning benchmark, demonstrating its effectiveness and practicality [13][28]. - The framework was evaluated under various pruning ratios (25%, 50%, 75%), consistently outperforming existing methods in key metrics such as L2 error and collision rates [30][34]. - The efficiency analysis showed that FastDriveVLA significantly reduces FLOPs and CUDA latency compared to other methods, enhancing real-time deployment capabilities [36][40]. Group 5: Contributions and Implications - The introduction of FastDriveVLA provides a new paradigm for efficient inference in VLA models, offering insights into task-specific token pruning strategies [43]. - The research highlights the importance of focusing on foreground information in autonomous driving tasks, which can lead to improved performance and reduced computational costs [5][43].
又帮到了一位同学拿到了自动驾驶算法岗......
自动驾驶之心· 2025-08-23 14:44
Core Viewpoint - The article emphasizes the importance of continuous learning and adaptation in the field of autonomous driving, particularly in light of industry shifts towards intelligent models and large models, while also highlighting the value of community support for knowledge sharing and job opportunities [1][2]. Group 1: Community and Learning Resources - The "Autonomous Driving Heart Knowledge Planet" is a comprehensive community platform that integrates video, text, learning paths, Q&A, and job exchange, aiming to grow from over 4,000 to nearly 10,000 members in two years [1][2]. - The community provides practical solutions for various topics such as entry points for end-to-end models, learning paths for multimodal large models, and engineering practices for data closed-loop 4D annotation [2][3]. - Members have access to over 40 technical routes, including industry applications, VLA benchmarks, and learning entry routes, significantly reducing search time for relevant information [2][3]. Group 2: Job Opportunities and Networking - The community has established internal referral mechanisms with multiple autonomous driving companies, facilitating job applications and resume submissions directly to desired companies [7]. - Regular job sharing and updates on available positions are provided, creating a complete ecosystem for autonomous driving professionals [15][30]. Group 3: Technical Learning and Development - The community offers a well-structured technical stack and roadmap for beginners, covering essential areas such as mathematics, computer vision, deep learning, and programming [11][32]. - Various learning routes are available for advanced topics, including end-to-end autonomous driving, 3DGS principles, and multimodal large models, catering to both newcomers and experienced professionals [16][34][40]. - The platform also hosts live sessions with industry leaders, providing insights into cutting-edge research and practical applications in autonomous driving [58][66].
聊一聊多模态的交叉注意力机制
自动驾驶之心· 2025-08-22 16:04
Core Insights - The article discusses the significance of Cross-Attention in multimodal tasks, emphasizing that simply concatenating features from different modalities is insufficient. It advocates for an interactive approach where one modality queries another for relevant contextual information [1][2]. Summary by Sections 1. Position of Cross-Attention in Multimodal Tasks - Cross-Attention allows one modality to actively query another, enhancing the interaction between different types of data such as text and images [1]. 2. Common Design Approaches - **Single-direction Cross-Attention**: Only one modality updates while the other remains static, suitable for information retrieval tasks [2][3]. - **Co-Attention**: Both modalities update by querying each other, commonly used in Visual Question Answering (VQA) [4][6]. - **Alternating Cross-Attention Layers**: Involves multiple rounds of querying between modalities, enhancing interaction depth, but increases computational load [9]. - **Hybrid Attention**: Combines self-attention within each modality and cross-attention between modalities, often seen in advanced multimodal Transformers [12]. 3. Design Considerations - **Feature Alignment**: Different modalities often have inconsistent feature dimensions, necessitating linear projection to a unified dimension [13]. - **Query and Key/Value Selection**: The choice of which modality acts as the query and which as the key/value depends on the task requirements [14]. - **Fusion Strategies**: Various methods exist for merging features from different modalities, including concatenation, weighted sums, and shared latent space mapping [20]. 4. Practical Implementation - The article provides a PyTorch example of implementing Cross-Attention, demonstrating how to structure the model and handle input data [18][19]. 5. Experience Summary - Recommendations include using single-direction attention for lightweight tasks and more complex approaches for deep reasoning tasks, while emphasizing the importance of feature alignment and attention masking to avoid noise [37].
ICCV'25!清华GS-Occ3D:纯视觉规模化Occ重建,自动标注新范式~
自动驾驶之心· 2025-08-22 16:04
Core Viewpoint - The article discusses the emergence of GS-Occ3D, a new paradigm for occupancy grid reconstruction using pure vision, which aims to address the challenges of high costs and scalability associated with traditional LiDAR-based methods in autonomous driving [3][10]. Group 1: Research Motivation and Contributions - The existing methods for occupancy grid labeling heavily rely on LiDAR, which requires expensive specialized mapping vehicles, limiting scalability [6]. - GS-Occ3D introduces a low-cost, scalable framework for occupancy grid labeling that effectively utilizes large amounts of crowd-sourced data from consumer vehicles [7]. - The method achieves state-of-the-art (SOTA) geometric reconstruction results in the Waymo dataset and demonstrates superior zero-shot generalization capabilities in the Occ3D-nuScenes dataset [10][36]. Group 2: Methodology Overview - GS-Occ3D employs a Gaussian surface representation based on octrees to optimize explicit geometric representation, enabling low-cost and efficient large-scale automatic labeling [10][13]. - The process involves generating sparse point clouds and ground surface elements from panoramic street views, followed by a labeling generation workflow that enhances point cloud density and explicitly handles occlusions [13][32]. - The resulting pure visual labels can train downstream occupancy grid models, allowing them to generalize to unseen scenarios and possess geometric reasoning capabilities [13][10]. Group 3: Quantitative Results - The method achieved a Chamfer Distance (CD) of 0.56 and a Peak Signal-to-Noise Ratio (PSNR) of 26.89 in the Waymo dataset, outperforming several existing methods [15]. - In terms of generalization and fitting results, the method demonstrated an Intersection over Union (IoU) of 44.7 and an F1 score of 61.8 on the Occ3D-Val (Waymo) dataset, indicating competitive performance [16]. - The zero-shot generalization ability of the method was highlighted, showing better performance in complex scenarios compared to LiDAR-based methods [24][32]. Group 4: Advantages of Pure Vision Method - The pure vision approach offers broader coverage compared to LiDAR, especially in large areas, and can outperform LiDAR in specific scenarios like reconstructing tall buildings [32]. - It exhibits superior zero-shot generalization capabilities, allowing models trained with pure vision labels to generalize across a wider range of geometries [32]. - The method provides rich semantic information at a lower cost, enabling the reconstruction of 3D labels with up to 66 categories, compared to only 16 categories in Occ3D [32][33]. Group 5: Challenges and Limitations - The inherent limitations of camera perspectives, such as the lack of rear visibility in the Waymo dataset, can lead to unavoidable information loss [34]. - Performance can be significantly affected by lighting conditions, particularly at night or in cases of exposure anomalies [34]. - The method may struggle in static scenes where the vehicle is stationary, necessitating prior knowledge for effective geometric reconstruction [34].
自动驾驶之心VLA技术交流群成立了(数据/模型/部署等方向)
自动驾驶之心· 2025-08-22 16:04
自动驾驶之心大模型VLA技术交流群成立了,欢迎大家加入一起交流VLA相关的内容:包括VLA数据集制 作、一段式VLA、分层VLA、基于大模型的端到端方案、基于VLM+DP的方案、量产落地、求职等内容。 感兴趣的同学欢迎添加小助理微信进群:AIDriver005, 备注:昵称+VLA加群。 ...
博世拿到某新能源重卡的智驾定点
自动驾驶之心· 2025-08-22 16:04
Core Viewpoint - Bosch has secured a significant order for intelligent driving systems from a leading new energy heavy truck manufacturer, marking a major breakthrough in the commercial vehicle sector [5][6]. Group 1: Order Acquisition - Bosch's recent contract includes three major vehicle platforms, covering over a hundred domestic and international models, such as tractors, mixers, and dump trucks [5]. - The competitive bidding process was intense, with many domestic companies participating, and Bosch managed to win despite being at a disadvantage initially [5]. Group 2: Market Dynamics - The commercial vehicle sector has seen a surge in demand for intelligent driving systems, driven by policy changes mandating Advanced Emergency Braking (AEB) systems [6]. - The current market is characterized by a high level of competition for low-level Advanced Driver Assistance Systems (ADAS), while high-level systems like Navigation on Automated Driving (NOA) remain less competitive due to higher technical barriers [6].
VLA方向的论文还不知怎么下手?有的同学已经CCF-A了......
自动驾驶之心· 2025-08-22 12:00
Core Insights - The article discusses the advancements of the Li Auto VLA driver model, highlighting its improved capabilities in understanding semantics, reasoning, and trajectory planning, which are crucial for autonomous driving [1][3][5] Group 1: VLA Model Capabilities - The VLA model demonstrates enhanced semantic understanding through multimodal input, improved reasoning via thinking chains, and a closer approximation to human driving intuition through trajectory planning [1] - Four core abilities of the VLA model are showcased: spatial understanding, reasoning ability, communication and memory capability, and behavioral ability [1][3] Group 2: Research and Development Trends - The VLA model has evolved from VLM+E2E, integrating various cutting-edge technologies such as end-to-end learning, trajectory prediction, visual language models, and reinforcement learning [5] - While traditional perception and planning tasks are still being optimized in the industry, the academic community is increasingly shifting focus towards large models and VLA, indicating a wealth of subfields still open for exploration [5] Group 3: VLA Research Guidance Program - A VLA research paper guidance program has been initiated, receiving positive feedback, aimed at helping participants systematically grasp key theoretical knowledge and develop their own research ideas [6] - The program includes a structured curriculum over 14 weeks, covering topics from traditional end-to-end autonomous driving to writing methodologies for research papers [9][11][30] Group 4: Course Structure and Requirements - The course is designed for a maximum of 8 participants per session, targeting individuals with a background in VLA and autonomous driving at various academic levels [12][15] - Participants are expected to have a foundational understanding of deep learning, Python programming, and familiarity with PyTorch, with specific hardware requirements suggested for optimal performance [21][22] Group 5: Expected Outcomes - Participants will gain insights into classic and cutting-edge research papers, coding skills, and methodologies for writing and submitting research papers, culminating in the production of a draft paper [20][34] - The program aims to enhance participants' understanding of algorithms, their advantages and disadvantages, and to stimulate their research ideas through structured guidance [20][34]
高校博士生人数增加60w,安家费逐渐被取消......
自动驾驶之心· 2025-08-22 12:00
Core Viewpoint - The article discusses the significant increase in the number of PhD students, with a rise of 600,000, while job opportunities have not increased correspondingly. It suggests that the housing allowance for new PhD hires in universities will gradually be phased out by 2025, indicating a trend towards more targeted funding for "achievement subsidies" and "high-level talent subsidies" [1]. Group 1: PhD Employment and Funding Trends - The housing allowance is not being eliminated entirely but will be distributed in a more phased and targeted manner, focusing on "high-quality" disbursement [1]. - The article emphasizes the importance of enhancing research output and competitiveness, suggesting that students should aim to publish in top conferences to secure funding [1]. Group 2: Research Guidance Services - The company "自动驾驶之心" offers comprehensive one-on-one guidance for students, from topic selection to submission and revision, with a high acceptance rate of 96% for their students [1]. - The service aims to address common issues faced by students, such as lack of guidance, fragmented knowledge, and difficulties in research processes [5]. Group 3: Target Audience for Services - The services are tailored for students in computer science who are seeking to improve their research capabilities, gain experience, and enhance their academic profiles for career advancement [10]. - The program is suitable for those who are currently studying or working in artificial intelligence and wish to boost their competitiveness in the job market [10]. Group 4: Service Features and Benefits - The guidance includes personalized mentorship, real-time interaction with tutors, and unlimited access to recorded sessions, catering to various publication goals [9]. - Students can receive recommendations from prestigious institutions and potential job placements in leading companies based on their performance [16].
从最初的2D方案到当前的VLA大框架,一代又一代的自驾路线是怎么样演变的?
自动驾驶之心· 2025-08-22 04:00
Core Viewpoint - The article emphasizes the importance of creating an engaging learning environment in the field of autonomous driving and AI, aiming to bridge the gap between industry and academia while providing resources for career development and technical knowledge sharing [1][3]. Group 1: Community and Resources - The "Autonomous Driving Heart Knowledge Planet" has evolved through multiple iterations, providing a comprehensive platform for academic and industry exchanges, including job opportunities and technical discussions [1]. - The community has compiled over 40 technical routes and resources, significantly reducing the time needed for information retrieval in the autonomous driving sector [1]. - Members include individuals from renowned universities and leading companies in the autonomous driving field, fostering a rich environment for knowledge sharing [12]. Group 2: Technical Learning and Development - The community offers a structured learning path for newcomers, including foundational knowledge in mathematics, computer vision, and deep learning, as well as practical programming skills [12][20]. - Various learning routes are available, such as end-to-end learning, multi-modal large models, and simulation frameworks, catering to different levels of expertise [12][34]. - The platform provides access to numerous open-source projects and datasets relevant to autonomous driving, enhancing practical learning and application [30][32]. Group 3: Job Opportunities and Networking - The community has established a job referral mechanism with multiple autonomous driving companies, facilitating direct connections between job seekers and employers [6]. - Regular job postings and sharing of internship opportunities are available, helping members stay informed about the latest openings in the industry [11][22]. - Members can engage in discussions about career choices and research directions, receiving guidance from experienced professionals in the field [89]. Group 4: Technical Discussions and Innovations - The community hosts discussions on cutting-edge topics such as VLA (Vision Language Architecture), world models, and diffusion models, keeping members updated on the latest advancements [44][48]. - Regular live sessions with industry experts are conducted, allowing members to learn about new technologies and methodologies in autonomous driving [85]. - The platform encourages collaboration and knowledge exchange, aiming to cultivate future leaders in the autonomous driving industry [3].