自动驾驶之心

Search documents
英伟达具身机器人“新大脑”即将揭晓
自动驾驶之心· 2025-08-25 23:34
>> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 今日,英伟达机器人账号在社交平台发布预热推文,"你安排好时间了吗?2025年8月25日。"配图中是一个黑色礼盒,上面是一张写着"好 好享受!"、带着黄仁勋签名落款的贺卡。 作者丨财联社 点击下方 卡片 ,关注" 具身智能之心 "公众号 券商认为,DeepSeek人工智能公司的涌现推动通用机器人大模型的发展,助力人形机器人实现具身智能,人形机器人产业链进入"百花齐 放,百家争鸣"阶段,目前人形机器人进入工业场景,已经成为国内外确定性较高的应用趋势,人形机器人商业化落地可期,建议关注受益 的国内零部件厂商,后续建议关注人形机器人产业链相关事件催化:国内外人形机器人本体厂商的成果发布等。 至于礼盒中的内容,英伟达已在两天前发出了预告视频。 在视频中,黄仁勋俯身在贺卡上写下"致机器人:好好享受你的新大脑吧!",随 后镜头切换至一个站在礼盒前的人形机器人,正拿起贺卡"阅读"。 值得注意的是, 在8月12日的行业顶级会议SIGGRAPH上,英伟达发布了开源物理AI应 ...
2025年了,生成和理解多模态大模型发展到哪一步了?
自动驾驶之心· 2025-08-25 23:34
Core Viewpoint - The article discusses the development trends of unified multimodal large models, particularly focusing on image understanding and generation, up to mid-2025, highlighting significant advancements and challenges in this field [1][2]. Group 1: Overview of Multimodal Large Models - The term "unified multimodal large models" primarily refers to models that integrate both image understanding and generation, excluding other modalities like Omni-LLM due to fewer academic papers in that area [3]. - Several notable early works in this domain include Google's Unified-IO, Alibaba's OFA, and Fudan's AnyGPT, which have significantly influenced subsequent research [3]. Group 2: Key Research Directions - Research on "integrated generation and understanding" of multimodal large models focuses on two main aspects: the development of visual tokenizers and the construction of suitable model architectures [14]. - The TokenFlow model by ByteDance employs different visual encoders for understanding and generation tasks, utilizing high-level semantic features for understanding and low-level features for generation [16][17]. Group 3: Model Architectures and Techniques - The Semantic-Priority Codebook (SPC) approach was introduced to improve the quality of image reconstruction tasks, highlighting the importance of semantic features in the quantization process [19][23]. - The QLIP model from UT Austin and Nvidia optimizes the visual tokenizer by aligning visual features suitable for generation with semantic information, using a unified visual encoder for both tasks [28][30]. Group 4: Training Strategies - The training strategy for QLIP involves two phases: the first focuses on learning semantically rich feature representations, while the second emphasizes improving image reconstruction quality [30][32]. - The UniTok model employs multi-codebook quantization to enhance codebook utilization, integrating visual features for both understanding and generation tasks [35][36]. Group 5: Recent Innovations - The DualToken model utilizes a single visual encoder to extract features for both understanding and generation, employing different visual codebooks for semantic and pixel features [39][41]. - The TokLIP model from Tencent also adopts a single-encoder approach, focusing on the alignment of visual features with text features through various loss functions [42][44].
小鹏超视距自动驾驶VLA是如何实现的?
自动驾驶之心· 2025-08-25 23:34
Core Viewpoint - The article discusses the development of NavigScene, a novel dataset and methodology by Xiaopeng Motors and the University of Central Florida, aimed at bridging the gap between local perception and global navigation in autonomous driving systems, enhancing their reasoning and planning capabilities in complex environments [3][9][10]. Group 1: Overview of NavigScene - NavigScene is designed to integrate local sensor data with global navigation context, addressing the limitations of existing autonomous driving systems that primarily rely on immediate visual information [3][5]. - The dataset includes two subsets: NavigScene-nuScenes and NavigScene-NAVSIM, which provide paired data of multi-view sensor inputs and corresponding natural language navigation instructions [9][14]. Group 2: Methodologies - Three complementary methodologies are proposed to utilize NavigScene: 1. Navigation-guided reasoning (NSFT) enhances visual-language models by incorporating navigation context [10][20]. 2. Navigation-guided preference optimization (NPO) improves the generalization of visual-language models in new navigation scenarios [24][26]. 3. Navigation-guided visual-language-action (NVLA) model integrates navigation guidance with traditional driving models for better performance in perception, prediction, and planning tasks [27][29]. Group 3: Experimental Results - Experiments demonstrate that integrating NavigScene significantly improves the performance of visual-language models in various driving-related tasks, including reasoning and planning [31][35]. - The results indicate that the combination of NSFT and NPO leads to notable enhancements in the models' ability to handle complex driving scenarios, reducing collision rates and improving trajectory accuracy [43][47].
末9硕双非本,现在有些迷茫。。。
自动驾驶之心· 2025-08-25 23:34
Core Viewpoint - The article emphasizes the importance of choosing a promising direction in the field of autonomous driving and robotics, highlighting the need for continuous learning and adaptation to industry trends [1][2]. Group 1: Industry Trends and Opportunities - The autonomous driving industry is still vibrant and offers numerous opportunities despite concerns about job saturation in traditional control systems [2][3]. - The community "Autonomous Driving Heart" aims to create a comprehensive platform for knowledge sharing, technical discussions, and job opportunities in the autonomous driving sector, with a target of reaching nearly 10,000 members in two years [2][3][19]. - The community provides access to over 40 technical routes and invites industry experts to answer questions, facilitating knowledge transfer and networking [3][19]. Group 2: Learning and Development Resources - The community offers a variety of resources, including video content, learning paths, and practical problem-solving discussions, to help both beginners and advanced learners in the field of autonomous driving [2][3][19]. - A detailed compilation of over 60 datasets related to autonomous driving is available, covering various aspects such as perception and trajectory prediction [29]. - The community has organized numerous live sessions with industry leaders, providing insights into the latest technologies and methodologies in autonomous driving [55]. Group 3: Job Opportunities and Networking - The community has established a job referral mechanism with multiple autonomous driving companies, facilitating direct connections between job seekers and potential employers [10][18]. - Regular job postings and sharing of internship opportunities are part of the community's offerings, helping members stay informed about the latest openings in the industry [26][18]. - Members can freely ask questions regarding career choices and research directions, receiving guidance from experienced professionals in the field [58][59].
从理想VLA看自动驾驶技术演进路线...
自动驾驶之心· 2025-08-25 11:29
Core Insights - The article discusses the advancements in the Li Auto VLA driver model, highlighting its improved capabilities in understanding semantics, reasoning, and trajectory planning, which are essential for autonomous driving [1][3] - The VLA model has evolved from VLM+E2E, integrating various cutting-edge technologies such as end-to-end learning, trajectory prediction, visual language models, and reinforcement learning [3] Summary by Sections VLA Model Capabilities - The VLA model enhances semantic understanding through multimodal input, excels in reasoning with a thought chain approach, and closely mimics human driving intuition via trajectory planning [1] - It possesses four core abilities: spatial understanding, reasoning, communication and memory, and behavioral capabilities [1] Research and Development Focus - The academic community is increasingly shifting towards large models and VLA, while traditional perception and planning tasks are still being optimized in the industry [3] - There is a growing interest in VLA, with many students seeking guidance on research papers related to this area, indicating a significant opportunity for academic contributions [3] Course Structure and Offerings - A structured course is being offered to help students systematically grasp key theoretical knowledge and develop practical skills in VLA research [5][12] - The course spans 12 weeks of online group research followed by 2 weeks of paper guidance, culminating in a maintenance period for ongoing support [13][33] Enrollment and Requirements - The program is limited to 6-8 participants per session, targeting individuals with a background in deep learning and basic knowledge of autonomous driving algorithms [11][20] - Participants are expected to have a foundational understanding of Python and PyTorch, with access to high-performance computing resources recommended [20] Learning Outcomes - Students will gain insights into classic and cutting-edge research papers, develop coding skills, and receive guidance on writing and submitting academic papers [19][33] - The course aims to produce a draft of a research paper as a tangible outcome of the learning experience [19][33]
正式结课!动静态/OCC/端到端自动标注一网打尽
自动驾驶之心· 2025-08-25 03:15
Core Viewpoint - The article emphasizes the increasing investment in automatic labeling by autonomous driving companies, highlighting the challenges and complexities involved in 4D automatic labeling, which integrates 3D spatial data with temporal dimensions [1][2]. Group 1: Challenges in Automatic Labeling - The main difficulties in 4D automatic labeling include high requirements for temporal consistency, complex multi-modal data fusion, challenges in generalizing dynamic scenes, conflicts between labeling efficiency and cost, and high demands for scene generalization in mass production [2][3]. Group 2: Course Overview - The course offers a comprehensive tutorial on the entire process of 4D automatic labeling, covering core algorithms and practical applications, aimed at enhancing algorithmic capabilities through real-world examples [2][3][4]. - Key topics include dynamic obstacle detection, SLAM reconstruction principles, static element labeling based on reconstruction graphs, and the mainstream paradigms of end-to-end labeling [3][4][5][6]. Group 3: Detailed Course Structure - Chapter 1 introduces the basics of 4D automatic labeling, its applications, required data, and algorithms involved, focusing on system time-space synchronization and sensor calibration [4]. - Chapter 2 delves into the process of dynamic obstacle labeling, covering offline 3D target detection algorithms and practical solutions to common engineering challenges [6]. - Chapter 3 focuses on laser and visual SLAM reconstruction, discussing its importance and the basic modules of reconstruction algorithms [7]. - Chapter 4 addresses the automation of static element labeling, emphasizing the need for accurate detection and tracking [9]. - Chapter 5 centers on the OCC labeling of general obstacles, detailing the input-output requirements and the processes for generating ground truth [10]. - Chapter 6 is dedicated to end-to-end ground truth generation, integrating various elements into a cohesive process [12]. - Chapter 7 discusses the data closed-loop topic, sharing insights on industry pain points and interview preparation for relevant positions [14]. Group 4: Target Audience and Course Benefits - The course is designed for researchers, students, and professionals looking to deepen their understanding of 4D automatic labeling and enhance their algorithm development capabilities [19][23]. - Participants will gain practical skills in 4D automatic labeling, including knowledge of cutting-edge algorithms and the ability to solve real-world problems [19].
某头部tire1被央企主机厂控股投资事宜确定~
自动驾驶之心· 2025-08-24 23:32
Core Viewpoint - The article discusses the strategic investment and control of a leading autonomous driving algorithm provider, referred to as Company Z, by a central state-owned enterprise, indicating a significant shift in the competitive landscape of the autonomous driving industry. Group 1: Investment and Control - Company Z has confirmed its strategic investment and control by a central state-owned enterprise, with the approval from relevant departments pending official announcement [4]. - The investment signifies that Company Z will officially join the "national team," gaining access to a broader customer base and substantial financial resources [5]. Group 2: Competitive Landscape - Company Z's entry into the Horizon ecosystem is expected to be a major benefit for Horizon, as Z is recognized for its strong engineering capabilities in mid-to-low computing power chip platforms [6]. - The competition is intensifying, with Company Z emerging as a formidable competitor to existing algorithm providers, particularly impacting the IPO prospects of another key player, Company QZ [6]. Group 3: Industry Trends - The autonomous driving sector is evolving into a large-scale industrial operation, requiring significant resources in terms of personnel, data, and technology, moving away from small entrepreneurial teams [6]. - Collaborations between major players and algorithm providers are becoming essential for competitiveness, as seen with various partnerships in the industry [6].
从零开始!自动驾驶端到端与VLA学习路线图~
自动驾驶之心· 2025-08-24 23:32
Core Viewpoint - The article emphasizes the importance of understanding end-to-end (E2E) algorithms and Visual Language Models (VLA) in the context of autonomous driving, highlighting the rapid development and complexity of the technology stack involved [2][32]. Summary by Sections Introduction to End-to-End and VLA - The article discusses the evolution of large language models over the past five years, indicating a significant technological advancement in the field [2]. Technical Foundations - The Transformer architecture is introduced as a fundamental component for understanding large models, with a focus on attention mechanisms and multi-head attention [8][12]. - Tokenization methods such as BPE (Byte Pair Encoding) and positional encoding are explained as essential for processing sequences in models [13][9]. Course Overview - A new course titled "End-to-End and VLA Autonomous Driving" is launched, aimed at providing a comprehensive understanding of the technology stack and practical applications in autonomous driving [21][33]. - The course is structured into five chapters, covering topics from basic E2E algorithms to advanced VLA methods, including practical assignments [36][48]. Key Learning Objectives - The course aims to equip participants with the ability to classify research papers, extract innovative points, and develop their own research frameworks [34]. - Emphasis is placed on the integration of theory and practice, ensuring that learners can apply their knowledge effectively [35]. Industry Demand and Career Opportunities - The demand for VLA/VLM algorithm experts is highlighted, with salary ranges between 40K to 70K for positions requiring 3-5 years of experience [29]. - The course is positioned as a pathway for individuals looking to transition into roles focused on autonomous driving algorithms, particularly in the context of emerging technologies [28].
自动驾驶转具身智能有哪些切入点?
自动驾驶之心· 2025-08-24 23:32
Core Viewpoint - The article discusses the transition from autonomous driving to embodied intelligence, highlighting the similarities and differences in algorithms and tasks between the two fields [1]. Group 1: Algorithm and Task Comparison - Embodied intelligence largely continues the algorithms used in robotics and autonomous driving, such as training and fine-tuning methods, as well as large models [1]. - There are notable differences in specific tasks, including data collection methods and the emphasis on execution hardware and structure [1]. Group 2: Community and Learning Resources - A full-stack learning community named "Embodied Intelligence Heart" has been established to share knowledge related to algorithms, data collection, and hardware solutions in the field of embodied intelligence [1]. - Key areas of focus within the community include VLA, VLN, Diffusion Policy, reinforcement learning, robotic arm grasping, pose estimation, robot simulation, multimodal large models, chip deployment, sim2real, and robot hardware structure [1].
超越一众SOTA!华为MoVieDrive:自动驾驶环视多模态场景生成最新世界模型~
自动驾驶之心· 2025-08-24 23:32
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 今天自动驾驶之心为大家分享 华为诺亚和多伦多大学最新的工作— MoVieDrive ! 自动驾驶环视多模态场景生成最新算法,超越 CogVideoX等一众SOTA。 如果您有相关工作需要分享,请在文末联系我们! 自动驾驶课程学习与技术交流群加入,也欢迎添加小助理微信AIDriver005做进一步咨询 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 论文作者 | Guile Wu等 编辑 | 自动驾驶之心 写在前面 & 笔者的个人理解 近年来,视频生成在自动驾驶领域的城市场景合成中展现出优越性。现有的自动驾驶视频生成方法主要集中在RGB视频生成上,缺乏支持多模态视频生成的能力。 然而多模态数据(如深度图和语义图)对于自动驾驶中的整体城市场景理解至关重要。虽然可以使用多个模型来生成不同的模态,但这会增加模型部署的难度,并 且无法利用多模态数据生成的互补线索。为了解决这个问题,本文提出了一种全新的面向自动驾驶的多模态环视视频生成方法。具体而言,我们构建了一个由 模 态共享组件 和 模态特定组件 组成的统一扩散T ...