Workflow
自动驾驶之心
icon
Search documents
分钟级长视频生成!地平线Epona:自回归扩散式的端到端自动驾驶世界模型(ICCV'25)
自动驾驶之心· 2025-07-07 12:17
Core Insights - The article discusses the development of Epona, a novel autoregressive diffusion world model for autonomous driving, which integrates the advantages of diffusion models and autoregressive models to support long video generation, trajectory control, and real-time motion planning within a single framework [2][33]. Group 1: Research Motivation - The research highlights the growing interest in world models as a key technology for simulating physical environments and assisting agents in planning and decision-making, particularly in high-dynamic and complex tasks like autonomous driving [6]. - Current world model architectures face significant limitations, particularly in their ability to provide high-quality long-term predictions and real-time motion planning [7]. Group 2: Innovations of Epona - Epona introduces two key innovations: decoupled spatiotemporal modeling, which separates temporal dynamics from fine-grained future world generation, and modular trajectory and video prediction, allowing seamless integration of motion planning and visual modeling [2][19]. - The model employs a new "chain-of-forward training strategy" to address error accumulation in autoregressive cycles while achieving high-resolution, long-duration generation [2][23]. Group 3: Performance Metrics - Epona demonstrates a 7.4% improvement in FVD metrics compared to existing methods, with the capability to predict durations of several minutes [2][26]. - In experiments, Epona can generate high-quality driving videos exceeding 2 minutes (600 frames) in length, significantly outperforming other state-of-the-art models [26]. Group 4: Comparison with Existing Models - Epona's design contrasts with existing models that either lack critical planning modules or are limited by low resolution and short-term generation capabilities [9][31]. - The article compares Epona's performance metrics with other models, showing significant advantages in both video length and quality [29][30]. Group 5: Future Implications - The advancements presented by Epona could pave the way for the next generation of end-to-end autonomous driving systems, reducing reliance on complex perception modules and expensive labeled data [6][33].
滴滴自动驾驶感知算法一面面经
自动驾驶之心· 2025-07-07 12:17
Core Viewpoint - Didi has a strong technical foundation in the autonomous driving sector, particularly in perception algorithms, and is a key player for those interested in pursuing careers in this field [2]. Group 1: Interview Process - The interview process for the perception algorithm position at Didi consists of three technical rounds, with a focus on project details and technical principles [2]. - Candidates are advised to thoroughly understand every detail on their resumes, as interviewers may ask in-depth questions [2]. Group 2: Technical Questions - The first round includes self-introduction and targeted questions about the candidate's research output and direction [3]. - Candidates are asked to explain the core innovations of their papers, leading to discussions on 2D object detection [4]. - The evolution of 2D object detection algorithms from traditional methods to deep learning is a key topic [5]. - Understanding of Anchor-Free detection, specifically the core process of the FCOS algorithm, is assessed [6]. - Candidates are questioned about their familiarity with end-to-end detection algorithms, reflecting the latest developments in the field [7]. - The mechanism of DETR in achieving end-to-end object detection is explored in depth [8]. Group 3: Project Experience - Candidates are expected to present their project experiences, such as a perception project based on the BEVDet model, detailing the algorithm architecture and detection process [9]. - Interviewers inquire about specific challenges faced during the implementation of algorithms in real-world applications and the solutions devised [10]. Group 4: Coding Assessment - A coding challenge is included, where candidates must write the NMS (Non-Maximum Suppression) post-processing code on-site [11]. Group 5: Community and Networking - A community has been established for job seekers in autonomous driving and related fields, with nearly 1,000 members from various companies, providing a platform for networking and support [12].
现在自动驾驶领域的行情怎么样了?都有哪些方案?
自动驾驶之心· 2025-07-07 06:47
最近有很多同学咨询我们自动驾驶产业到底怎么样了?有哪些职位和方案,今天为大家盘点下当下的一些情况! 所有内容出自AuotRobo求职星球,自动驾驶、具身智能、AI类求职聚集地!这里有最专业的面经和岗位分享~ 自动驾驶分级与应用 主要功能:行车,泊车,座舱,V2X 核心系统构成:芯片,软件,传感器 技术趋势一览 1)传统自动驾驶pipline 2)端到端自动驾驶 3)VLM方案 4)VLA方案 主机厂和自驾公司 1)主机厂 新势力:小鹏,理想,蔚来,华为,极氪,小米,零跑,岚图,深蓝(长安)等; 老牌车厂:比亚迪,吉利,长安,奇瑞(星途),长城,上汽(智己),广汽(埃安)外企:奔驰,大众,现代 等; 2)供应商 已经上市:地平线,小马智行,黑芝麻智能,文远智行,知行汽车等; 未上市:momenta,轻舟智行,元戎启行,卓驭,大疆大厂:百度,滴滴等,京东; 其它:商汤绝影,毫末智行,四维图新,经纬恒润等; 职位与方向一览 1)传统方案 定位建图: 1. 定位匹配 2. 建图(nerf,splatting) 感知层次: 1. 障碍物,红绿灯,地面元素 2. BEV算法,OCC ,mapfree 后融合:静态后融合、 ...
自动驾驶黄埔军校,一个死磕技术的地方~
自动驾驶之心· 2025-07-06 12:30
Core Viewpoint - The article discusses the transition of autonomous driving technology from Level 2/3 (assisted driving) to Level 4/5 (fully autonomous driving), highlighting the challenges and opportunities in the industry as well as the evolving skill requirements for professionals in the field [2]. Industry Trends - The shift towards high-level autonomous driving is creating a competitive landscape where traditional sensor-based approaches, such as LiDAR, are being challenged by cost-effective vision-based solutions like those from Tesla [2]. - The demand for skills in reinforcement learning and advanced perception algorithms is increasing, leading to a sense of urgency among professionals to upgrade their capabilities [2]. Talent Market Dynamics - The article notes a growing anxiety among seasoned professionals as they face the need to adapt to new technologies and methodologies, while newcomers struggle with the overwhelming number of career paths available in the autonomous driving sector [2]. - The reduction in costs for LiDAR technology, exemplified by Hesai Technology's price drop to $200 and BYD's 70% price reduction, indicates a shift in the market that requires continuous learning and adaptation from industry professionals [2]. Community and Learning Resources - The establishment of the "Autonomous Driving Heart Knowledge Planet" aims to create a comprehensive learning community for professionals, offering resources and networking opportunities to help individuals navigate the rapidly changing landscape of autonomous driving technology [7]. - The community has attracted nearly 4,000 members and over 100 industry experts, providing a platform for knowledge sharing and career advancement [7]. Technical Focus Areas - The article outlines several key technical areas within autonomous driving, including end-to-end driving systems, perception algorithms, and the integration of AI models for improved performance [10][11]. - It emphasizes the importance of understanding various subfields such as multi-sensor fusion, high-definition mapping, and AI model deployment, which are critical for the development of autonomous driving technologies [7].
自动驾驶之心求职辅导推出啦!1v1定制求职服务辅导~
自动驾驶之心· 2025-07-06 12:11
Core Viewpoint - The article introduces a new job coaching service focused on helping individuals transition into the intelligent driving sector, particularly targeting recent graduates and professionals without specific job experience in this field [2]. Summary by Sections Coaching Scope - Basic services include personalized 1-on-1 coaching sessions, analysis of the learner's profile, and the creation of a detailed learning plan to bridge the gap between current skills and job requirements [5][6]. - The service also offers resume optimization and job referral opportunities based on the learner's situation [6]. Pricing Structure - The coaching service is priced at 8000 per person, which includes a minimum of 10 online meetings, each lasting at least one hour [4]. Advanced Services - Additional services include project practice opportunities that can be added to resumes and simulated interviews that encompass both HR and business interviews, available for an extra fee [5][6]. Instructor Background - Instructors are industry experts with over 8 years of experience in intelligent driving, having worked with leading companies in the sector [7][9].
从25年顶会论文方向看后期研究热点是怎么样的?
自动驾驶之心· 2025-07-06 08:44
Core Insights - The article highlights the key research directions in computer vision and autonomous driving as presented at major conferences CVPR and ICCV, focusing on four main areas: general computer vision, autonomous driving, embodied intelligence, and 3D vision [2][3]. Group 1: Research Directions - In the field of computer vision and image processing, the main research topics include diffusion models, image quality assessment, semi-supervised learning, zero-shot learning, and open-world detection [3]. - Autonomous driving research is concentrated on end-to-end systems, closed-loop simulation, 3D ground segmentation (3DGS), multimodal large models, diffusion models, world models, and trajectory prediction [3]. - Embodied intelligence focuses on visual language navigation (VLA), zero-shot learning, robotic manipulation, end-to-end systems, sim-to-real transfer, and dexterous grasping [3]. - The 3D vision domain emphasizes point cloud completion, single-view reconstruction, 3D ground segmentation (3DGS), 3D matching, video compression, and Neural Radiance Fields (NeRF) [3]. Group 2: Research Support and Collaboration - The article offers support for various research needs in autonomous driving, including large models, VLA, end-to-end autonomous driving, 3DGS, BEV perception, target tracking, and multi-sensor fusion [4]. - In the embodied intelligence area, support is provided for VLA, visual language navigation, end-to-end systems, reinforcement learning, diffusion policy, sim-to-real, embodied interaction, and robotic decision-making [4]. - For 3D vision, the focus is on point cloud processing, 3DGS, and SLAM [4]. - General computer vision support includes diffusion models, image quality assessment, semi-supervised learning, and zero-shot learning [4].
资料汇总 | VLM-世界模型-端到端
自动驾驶之心· 2025-07-06 08:44
Core Insights - The article discusses the advancements and applications of visual language models (VLMs) and large language models (LLMs) in the field of autonomous driving and intelligent transportation systems [1][4][19]. Summary by Sections Overview of Visual Language Models - Visual language models are becoming increasingly important in the context of autonomous driving, enabling better understanding and interaction between visual data and language [4][10]. Recent Research and Developments - Several recent papers presented at conferences like CVPR and NeurIPS focus on enhancing the capabilities of VLMs and LLMs, including methods for improving object detection, scene understanding, and generative capabilities in driving scenarios [5][7][10][12]. Applications in Autonomous Driving - The integration of world models with VLMs is highlighted as a significant advancement, allowing for improved scene representation and predictive capabilities in autonomous driving systems [10][13][19]. Knowledge Distillation and Transfer Learning - Knowledge distillation techniques are being explored to enhance the performance of vision-language models, particularly in tasks related to detection and segmentation [8][9]. Future Directions - The article emphasizes the potential of foundation models in advancing autonomous vehicle technologies, suggesting a trend towards more scalable and efficient models that can handle complex driving environments [10][19].
deepseek技术解读(3)-MoE的演进之路
自动驾驶之心· 2025-07-06 08:44
Core Viewpoint - The article discusses the evolution of DeepSeek in the context of Mixture-of-Experts (MoE) models, highlighting innovations and improvements from DeepSeekMoE (V1) to DeepSeek V3, while maintaining a focus on the MoE technology route [1]. Summary by Sections 1. Development History of MoE - MoE was first introduced in 1991 with the paper "Adaptive Mixtures of Local Experts," and its framework has remained consistent over the years [2]. - Google has been a key player in the development of MoE, particularly with the release of "GShard" in 2020, which scaled models to 600 billion parameters [5]. 2. DeepSeek's Work 2.1. DeepSeek-MoE (V1) - DeepSeek V1 was released in January 2024, addressing two main issues: knowledge mixing and redundancy among experts [15]. - The architecture introduced fine-grained expert segmentation and shared expert isolation to enhance specialization and reduce redundancy [16]. 2.2. DeepSeek V2 MoE Upgrade - V2 introduced a device-limited routing mechanism to control communication costs by ensuring that activated experts are distributed across a limited number of devices [28]. - A communication balance loss was added to address potential congestion issues at the receiving end of the communication [29]. 2.3. DeepSeek V3 MoE Upgrade - V3 maintained the fine-grained expert and shared expert designs while upgrading the gating network from Softmax to Sigmoid to improve scoring differentiation among experts [36][38]. - The auxiliary loss for load balancing was eliminated to reduce its negative impact on the main model, replaced by a dynamic bias for load balancing [40]. - A sequence-wise auxiliary loss was introduced to balance token distribution among experts at the sequence level [42]. 3. Summary of DeepSeek's Innovations - The evolution of DeepSeek MoE has focused on balancing general knowledge and specialized knowledge through shared and fine-grained experts, while also addressing load balancing through various auxiliary losses [44].
具身智能,到了交卷的时刻了。。。
自动驾驶之心· 2025-07-06 03:10
点击下方 卡片 ,关注" 具身智能 之心 "公众号 具身智能无疑是这两年最火的技术关键词。从沉寂到疯狂,再到冷静。今年上半年很多家公司都在尝试具身量 产交卷。未来行业不再是 随便哪家发出来的 demo 和 pr 稿就可以引起轰动,业内技术人才很快就可以破案, 讲的好不如真可靠。最近像地瓜机器人演示了宇树Go2四足机器狗,效果已经可圈可点,相信未来会有更多的 量产产品问世! 可以说感知能力升级与多模态融合是具身技术路线发展的重要一环,在视觉感知之外,触觉感知则是这两年发 力的重点,特别是灵巧手领域,力控能大幅提升操作的精细度及结果反馈能力。多模态传感器融合技术使机器 人能够同时处理视觉、听觉、触觉等多种信息,这种融合不仅体现在硬件层面,更在于算法层面的深度整合。 大幅提升了环境感知的准确性和全面性。 大模型驱动的大脑算法正在不断地提升机器人对世界的经验认知与理解。特别是在人形机器人领域,大模型基 于多模态数据提升机器人的感知能力,推动机器人的自主学习、决策规划能力,并结合动作训练、行为交互训 练,有望提升动作的泛化能力。同时,轻量化的模型设计也成为行业落地的迫切需求,我们更需要低算力、多 模态、跨平台的轻量化模 ...
谷歌&伯克利新突破:单视频重建4D动态场景,轨迹追踪精度提升73%!
自动驾驶之心· 2025-07-05 13:41
Core Viewpoint - The research introduces a novel method called "Shape of Motion" that combines 3D Gaussian point technology with SE(3) motion representation, achieving a 73% improvement in 3D tracking accuracy compared to existing methods, with significant applications in AR/VR and autonomous driving [2][4]. Summary by Sections Introduction - The challenge of dynamic scene reconstruction from monocular video is likened to feeling an elephant in the dark due to the lack of information [7]. - Traditional methods rely on multi-view videos or depth sensors, making them less effective for dynamic scenes [7]. Core Contribution - The "Shape of Motion" technique enables the reconstruction of complete 4D scenes (3D space + time) from a single video, allowing for the tracking of object motion and rendering from any viewpoint [9][10]. - Two main innovations include low-dimensional motion representation using SE(3) motion bases and the integration of data-driven priors for a globally consistent dynamic scene representation [9][12]. Technical Analysis - The method employs 3D Gaussian points as the basic unit for scene representation, allowing for real-time rendering [10]. - Various data-driven priors, such as monocular depth estimation and long-range 2D trajectories, are utilized to overcome the under-constrained nature of monocular video reconstruction [11][12]. Experimental Results - The method outperforms existing techniques on the iPhone dataset, achieving a 73.3% accuracy in 3D tracking and a PSNR of 16.72 for new view synthesis [17][18]. - The 3D tracking error (EPE) is reported as low as 0.16 on the Kubric synthetic dataset, showing a 21% improvement over baseline methods [20]. Discussion and Future Outlook - The current method faces challenges such as training time and reliance on accurate camera pose estimation [25]. - Future directions include optimizing training time, enhancing view generation capabilities, and developing fully automated segmentation methods [25]. Conclusion - The "Shape of Motion" research marks a significant advancement in monocular dynamic reconstruction, with potential applications in real-time tracking for AR glasses and autonomous systems [26].