自动驾驶之心
Search documents
李飞飞发布的单GPU推理世界模型,自动驾驶应用还会远吗?
自动驾驶之心· 2025-10-21 00:06
被李飞飞的最新的世界模型刷屏了。 编辑 | 量子位 来源 | 李飞飞发布全新世界模型,单GPU就能跑! 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 刚刚,教母亲自宣布对外推出全新模型 RTFM (A Real-Time Frame Model),不仅具备实时运行、持久性和3D一致性,更关键的是—— 单张H100 GPU就能跑 。 此外,RTFM的设计遵循三大核心原则: 效率:仅需单张H100 GPU,RTFM便能以交互级帧率实时完成推理运算。 可扩展性:该架构具备随数据量与算力增长而持续扩展的能力。它通过端到端的通用架构从海量视频数据中自主学习,无需依赖显式3D表征 即可构建三维世界模型。 持久性:用户可无限时长与RTFM交互,所有场景将永久留存。该系统构建的持久化3D世界不会因视角转换而消失。 下面具体来看。 世界模型需要大量计算资源 强大的世界模型能够实时重建、生成并模拟具有持久性、可交互且物理精确的世界。这类模型将彻底改变从媒体到机器人技术等各行各业。 过去 ...
转行多家自动驾驶大厂的经验分享
自动驾驶之心· 2025-10-21 00:06
Core Insights - The article emphasizes the importance of seizing opportunities and continuous learning in the rapidly evolving field of autonomous driving, as illustrated by the experiences of a professional who transitioned from banking to the autonomous driving industry [1][2]. Group 1: Career Development in Autonomous Driving - The transition from a traditional banking career to the autonomous driving sector was facilitated by the growing demand for talent in the industry, particularly in 2020 [1]. - The individual initially started in algorithm evaluation, gradually moving to more advanced roles in perception and safety algorithms, highlighting the significance of building foundational skills and adapting to industry trends [1]. Group 2: Community and Learning Resources - The "Autonomous Driving Heart Knowledge Planet" community has over 4,000 members and aims to grow to nearly 10,000 in the next two years, providing a platform for knowledge sharing and technical discussions [4][5]. - The community offers a comprehensive learning environment, including video content, written materials, learning pathways, and job exchange opportunities, catering to both beginners and advanced learners [7][11]. Group 3: Technical Learning and Support - The community has organized resources covering over 40 technical pathways in autonomous driving, addressing various topics such as end-to-end learning, multi-modal models, and data annotation practices [19][21]. - Members can access practical guidance on entering the field, including specific learning routes for different aspects of autonomous driving technology [8][13]. Group 4: Industry Engagement and Networking - The community collaborates with industry leaders and academic experts to provide insights into the latest trends and challenges in autonomous driving, fostering a network for professional growth [9][18]. - Members are encouraged to engage with industry professionals for job referrals and to stay updated on academic advancements and industrial applications [21][23].
世界模型深入浅出 | VQ家族论文整理(VQ-VAE/VQ-GAN/RQ-VAE等)
自动驾驶之心· 2025-10-21 00:06
编辑 | 自动驾驶之心 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 约了知乎大佬@论文推土机,整理下世界模型技术栈下VQ家族的相关论文,分享给大家! >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 为什么要离散化: 作者 | 论文推土机 离散化直接应用到像素级ar: 像素级 AR 的困境 :直接在像素空间做自回归步数过大(256×256 需约 20 万步),难以落地。 "先压缩后生成"的主流与隐患 :VQ-VAE/VQ-GAN/FSQ 等"图像 tokenizer"在 32×32 或 16×16 网格上生成,再解码回像素;但这是 强压缩 ,会引入信息损失 (SEED 可视化重构示例:语义对,但细节走样)。 信息论下的下界估算 :以 ImageNet-64 平均熵估算,一个长度为V的词表,信息容量是log2(V), 若想在 L=32×32 或 16×16 的长度上"无损"承载图像信息,词表规模 需夸张到 甚至 ,远超现有 codebook 能力—— 强压缩必然有损。 然而,直接在像素空间上操作的最大问题是——序列太长,生成太慢。在多数应用场景中,图片 ...
相约杭州!具身智能之心首次赞助IROS并现场颁奖
自动驾驶之心· 2025-10-20 06:30
在机器人系统不断迈向真实世界的进程中,感知系统的稳定性、鲁棒性与泛化能力正成为制约其 部署能力的关键因素。面对动态人群、恶劣天气、传感器故障、跨平台部署等复杂环境条件,传 统感知算法往往面临性能大幅下降的挑战。 为此, RoboSense Challenge 2025 应运而生。该挑战赛旨在系统性评估机器人在真实场景下的感 知与理解能力,推动多模态感知模型的稳健性研究,鼓励跨模态融合与任务泛化方向的创新探 索。 | Important Dates | | --- | | egistration | From June 2025 | | --- | --- | | ompetition Server Online | June 15th, 2025 | | hase One Deadline | August 15th, 2025 | | hase Two Deadline | September 15th, 2025 | | ward Decision @ IROS 2025 | October 19th 2025 | 该赛事由新加坡国立大学、南洋理工大学、香港科技大学、香港科技大学(广州)、密歇根大学 机器 ...
手撕大模型,KVCache 原理及代码解析
自动驾驶之心· 2025-10-20 06:30
Core Insights - The article discusses the importance of KV Cache in enhancing the efficiency of large language models (LLMs) during autoregressive inference, particularly in the context of the Transformer architecture [1][20]. Group 1: Need for KV Cache - KV Cache is essential for storing intermediate computation results, which significantly improves the model's operational efficiency during text generation tasks [1][20]. - In standard Transformer decoding, each new token generation requires attention calculations that involve all previous tokens, leading to high computational complexity [2][6]. Group 2: Working Principle of KV Cache - The core idea of KV Cache is to cache the historical Key (K) and Value (V) matrices, thus avoiding redundant calculations and reducing time complexity from O(n²) to O(n) [4][7]. - The process involves calculating the new Query (Q) matrix and performing attention calculations with the cached K and V matrices, allowing for efficient token generation [4][10]. Group 3: Technical Details of KV Cache - KV Cache typically maintains independent caches for each attention head, with the cache structure dynamically growing until it reaches the model's maximum sequence length [11]. - While KV Cache improves speed, it requires additional memory, with models like GPT-3 consuming approximately 20KB of memory per token, leading to significant memory usage during batch processing [12]. Group 4: Optimization Strategies for KV Cache - Strategies such as Paged KV Cache, dynamic cache management, quantization, and selective caching are employed to enhance the efficiency of KV Cache while managing memory usage [22][18]. Group 5: Code Implementation - The article provides a code example demonstrating the implementation of KV Cache in self-attention mechanisms using PyTorch, highlighting the modifications needed to incorporate caching [14][17]. Group 6: Conclusion - Understanding the workings of KV Cache is crucial for optimizing inference performance in large models and addressing challenges in practical deployment [20].
今日开课!清华团队带队梳理自动驾驶VLA学习路线:算法+实践
自动驾驶之心· 2025-10-19 23:32
Core Viewpoint - The focus of academia and industry is shifting towards VLA (Visual Language Action), which provides human-like reasoning capabilities for more reliable and safer autonomous driving [1][4]. Summary by Sections Overview of Autonomous Driving VLA - Autonomous driving VLA can be categorized into modular VLA, integrated VLA, and reasoning-enhanced VLA [1]. - Traditional perception methods like BEV (Bird's Eye View) and lane detection are becoming mature, leading to decreased attention from both academia and industry [4]. Key Content of Autonomous Driving VLA - Core components of autonomous driving VLA include visual perception, large language models, action modeling, large model deployment, and dataset creation [7]. - Cutting-edge algorithms such as Chain-of-Thought (CoT), Mixture of Experts (MoE), Retrieval-Augmented Generation (RAG), and reinforcement learning are at the forefront of this field [7]. Course Structure - The course titled "Autonomous Driving VLA and Large Model Practical Course" includes detailed explanations of cutting-edge algorithms in the three subfields of autonomous driving VLA, along with practical assignments [8]. Chapter Summaries 1. **Introduction to VLA Algorithms** - This chapter provides a comprehensive overview of VLA algorithms, their concepts, and development history, along with open-source benchmarks and evaluation metrics [14]. 2. **Algorithm Fundamentals of VLA** - Focuses on foundational knowledge of Vision, Language, and Action modules, and includes a section on deploying and using popular large models [15]. 3. **VLM as an Autonomous Driving Interpreter** - Discusses the role of VLM (Visual Language Model) in scene understanding and covers classic and recent algorithms like DriveGPT4 and TS-VLM [16]. 4. **Modular & Integrated VLA** - Explores the evolution of language models from passive descriptions to active planning components, emphasizing the direct mapping from perception to control [17]. 5. **Reasoning-Enhanced VLA** - Focuses on the trend of integrating reasoning modules into autonomous driving models, highlighting the parallel output of control signals and natural language explanations [18]. 6. **Capstone Project** - Involves practical tasks starting from network construction, allowing participants to customize datasets and fine-tune models, emphasizing hands-on experience [21]. Learning Outcomes - The course aims to advance the understanding of autonomous driving VLA in both academic and industrial contexts, equipping participants with the ability to apply VLA concepts in real-world projects [23]. Course Schedule - The course is set to begin on October 20, with a duration of approximately two and a half months, featuring offline video lectures and online Q&A sessions [24]. Prerequisites - Participants are expected to have a foundational knowledge of autonomous driving, familiarity with transformer models, reinforcement learning, and basic mathematical concepts [25].
9篇NeurIPS工作,我们读出了「3D渲染与重建」的三个确定方向
自动驾驶之心· 2025-10-19 23:32
Core Insights - The article discusses the advancements in 3D Rendering & Reconstruction, particularly focusing on dynamic scene reconstruction and the integration of generative and editable 3D assets. It highlights the shift from merely rendering to creating and manipulating 3D environments, emphasizing the importance of efficiency, stability, and usability in real-world applications [2][60]. Group 1: Dynamic Scene and Temporal Reconstruction - Research in dynamic scene reconstruction aims to not only rebuild static geometries but also to express, compress, and render changes over time, effectively creating a 4D representation [2][4]. - The ReCon-GS framework improves training efficiency by approximately 15%, reduces memory usage by half while maintaining the same visual quality, and enhances the stability and robustness of free-viewpoint video (FVV) synthesis [5][6]. - ProDyG introduces a closed-loop system for tracking, mapping, and rendering, achieving dynamic SLAM-level camera tracking and improved stability for long sequences [10][12]. Group 2: Structural Innovations in Gaussian Splatting - The research focuses on making 3D Gaussian Splatting (3DGS) deployable and maintainable, ensuring that large scenes do not exceed memory limits and can run on mobile devices [20][21]. - The LODGE framework enhances the usability of large-scale 3DGS rendering by integrating Level-of-Detail (LOD) techniques, resulting in lower latency and memory usage [23][24]. - The Gaussian Herding across Pens method achieves near-lossless quality while retaining only about 10% of the original Gaussian data, providing a mathematically grounded approach to global compression [28][29]. Group 3: Generative and Editable 3D - The focus of generative and editable 3D research is to not only recreate real-world scenes but also to generate new assets, allowing for component splitting, rigging, animation, and material modification [42][44]. - The PhysX-3D framework emphasizes the generation of 3D assets that are not only visually appealing but also functional for physical simulations and robotics applications [46][47]. - The PartCrafter model enables the generation of modular 3D meshes that can be easily edited and rearranged, improving the efficiency of asset creation [48][50]. Group 4: Current Trends and Future Directions - The current research trends indicate a clear direction towards making dynamic reconstruction more efficient and stable, refining Gaussian methods for practical deployment, and enhancing the capabilities of 3D asset generation and editing [60]. - The evaluation criteria for these technologies are evolving to include not just clarity or scores but also latency, bandwidth, energy consumption, stability, and editability, which are crucial for real-world applications [60].
4000人的自动驾驶技术社区,日常提供这些咨询......
自动驾驶之心· 2025-10-19 23:32
Core Insights - The article emphasizes the importance of making learning engaging and serving as a bridge between industries and educational institutions, particularly in the fields of AI and autonomous driving [1] Group 1: Community and Resources - The community has created a comprehensive platform for academic and industrial exchanges, providing access to cutting-edge content, industry insights, and job opportunities [2][12] - The platform has compiled over 40 technical routes and invited numerous industry experts to answer questions and provide guidance [2][15] - Members can access a variety of resources, including open-source projects, datasets, and learning paths tailored for different levels of expertise [15][30][32] Group 2: Learning Pathways - The community offers structured learning pathways for beginners, intermediate, and advanced learners in autonomous driving technologies [8][10][16] - Specific learning routes include areas such as perception, simulation, and planning control, catering to both academic and practical applications [15][34] - The platform also provides a detailed overview of the latest trends and technologies in autonomous driving, including VLA (Vehicle Language Architecture) and world models [42][38] Group 3: Networking and Collaboration - The community facilitates networking among members from prestigious universities and leading companies in the autonomous driving sector [15][26] - Regular live sessions and discussions with industry leaders are organized to enhance knowledge sharing and collaboration [79][80] - Members are encouraged to engage in discussions about career choices and research directions, fostering a supportive environment for professional growth [80][82]
李想:特斯拉V14也用了VLA相同的技术
自动驾驶之心· 2025-10-19 23:32
编辑 | 理想TOP2 转自 | 李想: 特斯拉V14也用了VLA相同技术|25年10月18日B站图文版压缩版 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 压缩版: 视频共计21min24s,花了10min51s介绍对OpenAI定义的5阶段的理解,做了很多类比。认为OpenAI在AI应用/模型/规范定义都做得非常好。 聊天机器人 (Chatbots):背后是基座模型,功能是压缩人类已知的数字知识。好比人上学到大学毕业,打下知识基础。 推理者 (Reasoners):具备思维链,能进行连续性思考和任务,主要依赖SFT和RLHF训练。好比人读研或有师傅带,获得经验。 智能体 (Agents):AI真正开始工作,能使用工具完成长任务。这对AI的专业性和可靠性要求极高(需达到八九十分才合格),好比人胜任一个专业岗位。 创新者 (Innovators):为解决智能体专业性难题,通过出题和解题来进行强化训练。这需要世界模型和RLAIF(AI反馈强化学习)来模拟真实环境的训练 ...
过去一个月高强度RL的实践和思考 - 如何涨点?
自动驾驶之心· 2025-10-19 23:32
Core Insights - The article discusses the recent advancements and challenges in Reinforcement Learning (RL) for Visual Language Models (VLM), emphasizing the importance of foundational work and iterative improvements in achieving performance gains [2][4]. RL Goals - The primary objectives for RL in VLM include achieving a 1-2 point increase in overall performance on SFT model versions and exceeding 1-2 points in specific benchmarks such as mathematics and instruction adherence [5]. RL Overall Approach - The essence of RL is to enhance sampling efficiency rather than enabling the base model to learn new knowledge. It is noted that the base model can outperform RL models in terms of correct response probability when given unlimited attempts [7][8]. Challenges in VLM RL - Key challenges include the selection of efficient RL algorithms, the need for high infrastructure requirements, and the sensitivity of RL to data quality and organization [10][12]. Data Organization - Effective data organization is crucial, requiring a balanced mix of tasks and high-quality input data. The output length is also significantly related to the RL algorithm used, necessitating careful consideration of training data characteristics [13][14]. Key Findings and Conclusions - Short responses negatively impact training effectiveness, and it is essential to construct pairs of responses with clear distinctions between acceptable and rejectable outputs. The importance of meticulous data checking and the absence of a "silver bullet" solution are emphasized [19][24].