自动驾驶之心
Search documents
SFT的本质,其实是在优化RL目标的下界...
自动驾驶之心· 2025-10-22 00:03
Core Insights - The article establishes that under sparse rewards, the training objective of Standard Fine-Tuning (SFT) is a loose lower bound of the Reinforcement Learning (RL) objective, and introduces a bridge distribution to tighten this lower bound while maintaining training stability [1][9][23]. Group 1: Relationship Between SFT and RL - The training objective function for RL strategy gradient algorithms is defined, linking SFT and RL through the derivation of the objective function [4][3]. - SFT operates on a fixed set of labeled data, contrasting with RL's online sampling, which optimizes the strategy model based on reward values [5][9]. - The article demonstrates that SFT's optimization goal can be viewed as a lower bound of the RL objective, indicating that SFT training can yield some effectiveness [9][23]. Group 2: Importance Sampling and Adjustments - The article discusses the application of importance sampling to transition from online to offline sampling in the RL training objective [6][11]. - A key finding is that the lower bound of SFT may become looser as training progresses, necessitating adjustments to tighten this bound [9][11]. - The introduction of an auxiliary distribution is proposed to adjust the SFT training objective, allowing for a tighter lower bound while ensuring training stability [11][12]. Group 3: Properties of iw SFT - The iw SFT formulation incorporates a weight coefficient that can be freely adjusted, allowing for the tightening of the lower bound [11][13]. - The choice of the auxiliary distribution is critical; it should be close to the reference distribution to ensure a tight lower bound while maintaining stability [13][14]. - Two methods for constraining importance weights are proposed: clipping the importance weights and smoothing them to reduce variance [14][15]. Group 4: Practical Implications - The article illustrates the advantages of iw SFT through a multi-armed bandit example, showing how it can effectively utilize negative sample information to improve strategy convergence [18][19][20]. - The overall conclusion emphasizes the importance of understanding the relationship between SFT and RL, and how adjustments can enhance training outcomes [23].
我们正在寻找自动驾驶领域的合伙人...
自动驾驶之心· 2025-10-22 00:03
Group 1 - The article announces the recruitment of 10 outstanding partners for the autonomous driving sector, focusing on course development, paper guidance, and hardware research [2] - The main areas of expertise sought include large models, multimodal models, diffusion models, end-to-end systems, embodied interaction, joint prediction, SLAM, 3D object detection, world models, closed-loop simulation, and model deployment and quantization [3] - Candidates are preferred from QS200 universities with a master's degree or higher, especially those with significant contributions to top conferences [4] Group 2 - The compensation package includes resource sharing for job seeking, doctoral recommendations, and study abroad opportunities, along with substantial cash incentives and collaboration on entrepreneurial projects [5] - Interested parties are encouraged to add WeChat for consultation, specifying "organization/company + autonomous driving cooperation inquiry" [6]
我们的具身社区,最近又增加了很多模块~
自动驾驶之心· 2025-10-22 00:03
Core Viewpoint - The article emphasizes the development and enhancement of a community focused on embodied intelligence, highlighting the addition of various modules and resources to support members in their projects and learning [1][14]. Group 1: Community Development - The community has expanded its sections to include VLA, real2sim2real, mobile operations, world models, and domain adaptation, along with high-quality live broadcasts [1]. - The community aims to create a closed-loop exchange across various fields, including industry, academia, job seeking, and Q&A [1]. Group 2: Live Sharing and Technical Resources - Continuous live sharing sessions, including roundtable forums, are organized to discuss the current state and challenges in the embodied intelligence industry [3]. - A comprehensive technical roadmap has been developed for beginners, providing a structured learning path [5]. Group 3: Industry and Project Solutions - Valuable industry frameworks and project solutions are provided for members already engaged in related research [9]. - The community has established a job referral mechanism with several embodied intelligence companies, facilitating job placements for members [11]. Group 4: Educational Resources and Networking - The community offers a compilation of over 40 open-source projects and nearly 60 datasets related to embodied intelligence, along with mainstream simulation platforms and various learning routes [14]. - Members can access exclusive learning videos and documents, fostering a conducive learning environment and networking opportunities [19]. Group 5: Comprehensive Resource Compilation - The community has gathered resources on various aspects of embodied intelligence, including research reports, books, component manufacturers, and simulation platforms [22][25][27][37]. - Specific learning paths for embodied perception and interaction, as well as reinforcement learning, are outlined to assist members in their studies [43][45][59].
提供最专业的平台和运营团队!我们正在招募运营的同学~
自动驾驶之心· 2025-10-21 00:06
Core Viewpoint - The automatic driving industry is rapidly evolving, with increasing demand and a growing number of business lines, indicating a healthy development in the sector [1]. Group 1: Team Overview - The team has developed four key IPs over two years: embodied intelligence, automatic driving, 3D vision, and large model technology, with a total audience of nearly 360,000 across various platforms [1]. Group 2: Recruitment - The company is hiring full-time and part-time positions for operations and sales roles to support its expanding business [2]. Group 3: Job Responsibilities and Requirements - Responsibilities for the operations role include managing course progress, enhancing platform engagement, and developing content related to the automatic driving and AI industries [4]. - The sales role involves creating promotional content for online and hardware products and liaising with hardware manufacturers and academic/enterprise clients [5][6]. - Candidates are expected to have strong execution skills, a relevant educational background, and familiarity with social media platforms [12]. Group 4: Growth Opportunities - The company offers exposure to top-tier operational teams, providing opportunities to learn operational techniques and sales strategies, which can lead to rapid personal growth [7]. - There are also opportunities for further academic pursuits, such as research and doctoral studies, which can enhance personal development [9].
李飞飞发布的单GPU推理世界模型,自动驾驶应用还会远吗?
自动驾驶之心· 2025-10-21 00:06
Core Insights - The article discusses the launch of a new model called RTFM (A Real-Time Frame Model) by Fei-Fei Li, which is capable of real-time operation, persistence, and 3D consistency, and can run on a single H100 GPU [3][5][15] Group 1: Model Features - RTFM operates with high efficiency, requiring only one H100 GPU to perform inference at interactive frame rates [5] - The model is designed for scalability, allowing it to expand with increasing data and computational power without relying on explicit 3D representations [5][14] - RTFM enables users to interact indefinitely, with all scenes being permanently retained, ensuring that the constructed 3D world does not disappear with changes in perspective [6] Group 2: Computational Demands - The demand for computational resources in generative world modeling is significantly higher than that of current large language models [10] - Generating a 60-frame 4K interactive video stream requires over 100,000 tokens per second, and maintaining over an hour of continuous interaction could exceed 100 million tokens [11][12] - The team believes that methods that can elegantly scale with computational growth will dominate the AI field, benefiting from the decreasing costs of computational power [14] Group 3: Learning and Rendering - RTFM utilizes a novel approach by training a single neural network to generate 2D images from 2D inputs without constructing explicit 3D representations [17][19] - The model blurs the lines between "reconstruction" and "generation," allowing it to learn complex effects like reflections and shadows through end-to-end data training [21] - RTFM employs a spatial memory structure, using frames with poses to maintain persistence and context during interactions [26][27] Group 4: Availability - The RTFM model is now available in a preview version for users to experience and provide feedback [28]
转行多家自动驾驶大厂的经验分享
自动驾驶之心· 2025-10-21 00:06
Core Insights - The article emphasizes the importance of seizing opportunities and continuous learning in the rapidly evolving field of autonomous driving, as illustrated by the experiences of a professional who transitioned from banking to the autonomous driving industry [1][2]. Group 1: Career Development in Autonomous Driving - The transition from a traditional banking career to the autonomous driving sector was facilitated by the growing demand for talent in the industry, particularly in 2020 [1]. - The individual initially started in algorithm evaluation, gradually moving to more advanced roles in perception and safety algorithms, highlighting the significance of building foundational skills and adapting to industry trends [1]. Group 2: Community and Learning Resources - The "Autonomous Driving Heart Knowledge Planet" community has over 4,000 members and aims to grow to nearly 10,000 in the next two years, providing a platform for knowledge sharing and technical discussions [4][5]. - The community offers a comprehensive learning environment, including video content, written materials, learning pathways, and job exchange opportunities, catering to both beginners and advanced learners [7][11]. Group 3: Technical Learning and Support - The community has organized resources covering over 40 technical pathways in autonomous driving, addressing various topics such as end-to-end learning, multi-modal models, and data annotation practices [19][21]. - Members can access practical guidance on entering the field, including specific learning routes for different aspects of autonomous driving technology [8][13]. Group 4: Industry Engagement and Networking - The community collaborates with industry leaders and academic experts to provide insights into the latest trends and challenges in autonomous driving, fostering a network for professional growth [9][18]. - Members are encouraged to engage with industry professionals for job referrals and to stay updated on academic advancements and industrial applications [21][23].
世界模型深入浅出 | VQ家族论文整理(VQ-VAE/VQ-GAN/RQ-VAE等)
自动驾驶之心· 2025-10-21 00:06
编辑 | 自动驾驶之心 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 约了知乎大佬@论文推土机,整理下世界模型技术栈下VQ家族的相关论文,分享给大家! >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 为什么要离散化: 作者 | 论文推土机 离散化直接应用到像素级ar: 像素级 AR 的困境 :直接在像素空间做自回归步数过大(256×256 需约 20 万步),难以落地。 "先压缩后生成"的主流与隐患 :VQ-VAE/VQ-GAN/FSQ 等"图像 tokenizer"在 32×32 或 16×16 网格上生成,再解码回像素;但这是 强压缩 ,会引入信息损失 (SEED 可视化重构示例:语义对,但细节走样)。 信息论下的下界估算 :以 ImageNet-64 平均熵估算,一个长度为V的词表,信息容量是log2(V), 若想在 L=32×32 或 16×16 的长度上"无损"承载图像信息,词表规模 需夸张到 甚至 ,远超现有 codebook 能力—— 强压缩必然有损。 然而,直接在像素空间上操作的最大问题是——序列太长,生成太慢。在多数应用场景中,图片 ...
相约杭州!具身智能之心首次赞助IROS并现场颁奖
自动驾驶之心· 2025-10-20 06:30
在机器人系统不断迈向真实世界的进程中,感知系统的稳定性、鲁棒性与泛化能力正成为制约其 部署能力的关键因素。面对动态人群、恶劣天气、传感器故障、跨平台部署等复杂环境条件,传 统感知算法往往面临性能大幅下降的挑战。 为此, RoboSense Challenge 2025 应运而生。该挑战赛旨在系统性评估机器人在真实场景下的感 知与理解能力,推动多模态感知模型的稳健性研究,鼓励跨模态融合与任务泛化方向的创新探 索。 | Important Dates | | --- | | egistration | From June 2025 | | --- | --- | | ompetition Server Online | June 15th, 2025 | | hase One Deadline | August 15th, 2025 | | hase Two Deadline | September 15th, 2025 | | ward Decision @ IROS 2025 | October 19th 2025 | 该赛事由新加坡国立大学、南洋理工大学、香港科技大学、香港科技大学(广州)、密歇根大学 机器 ...
手撕大模型,KVCache 原理及代码解析
自动驾驶之心· 2025-10-20 06:30
Core Insights - The article discusses the importance of KV Cache in enhancing the efficiency of large language models (LLMs) during autoregressive inference, particularly in the context of the Transformer architecture [1][20]. Group 1: Need for KV Cache - KV Cache is essential for storing intermediate computation results, which significantly improves the model's operational efficiency during text generation tasks [1][20]. - In standard Transformer decoding, each new token generation requires attention calculations that involve all previous tokens, leading to high computational complexity [2][6]. Group 2: Working Principle of KV Cache - The core idea of KV Cache is to cache the historical Key (K) and Value (V) matrices, thus avoiding redundant calculations and reducing time complexity from O(n²) to O(n) [4][7]. - The process involves calculating the new Query (Q) matrix and performing attention calculations with the cached K and V matrices, allowing for efficient token generation [4][10]. Group 3: Technical Details of KV Cache - KV Cache typically maintains independent caches for each attention head, with the cache structure dynamically growing until it reaches the model's maximum sequence length [11]. - While KV Cache improves speed, it requires additional memory, with models like GPT-3 consuming approximately 20KB of memory per token, leading to significant memory usage during batch processing [12]. Group 4: Optimization Strategies for KV Cache - Strategies such as Paged KV Cache, dynamic cache management, quantization, and selective caching are employed to enhance the efficiency of KV Cache while managing memory usage [22][18]. Group 5: Code Implementation - The article provides a code example demonstrating the implementation of KV Cache in self-attention mechanisms using PyTorch, highlighting the modifications needed to incorporate caching [14][17]. Group 6: Conclusion - Understanding the workings of KV Cache is crucial for optimizing inference performance in large models and addressing challenges in practical deployment [20].
今日开课!清华团队带队梳理自动驾驶VLA学习路线:算法+实践
自动驾驶之心· 2025-10-19 23:32
Core Viewpoint - The focus of academia and industry is shifting towards VLA (Visual Language Action), which provides human-like reasoning capabilities for more reliable and safer autonomous driving [1][4]. Summary by Sections Overview of Autonomous Driving VLA - Autonomous driving VLA can be categorized into modular VLA, integrated VLA, and reasoning-enhanced VLA [1]. - Traditional perception methods like BEV (Bird's Eye View) and lane detection are becoming mature, leading to decreased attention from both academia and industry [4]. Key Content of Autonomous Driving VLA - Core components of autonomous driving VLA include visual perception, large language models, action modeling, large model deployment, and dataset creation [7]. - Cutting-edge algorithms such as Chain-of-Thought (CoT), Mixture of Experts (MoE), Retrieval-Augmented Generation (RAG), and reinforcement learning are at the forefront of this field [7]. Course Structure - The course titled "Autonomous Driving VLA and Large Model Practical Course" includes detailed explanations of cutting-edge algorithms in the three subfields of autonomous driving VLA, along with practical assignments [8]. Chapter Summaries 1. **Introduction to VLA Algorithms** - This chapter provides a comprehensive overview of VLA algorithms, their concepts, and development history, along with open-source benchmarks and evaluation metrics [14]. 2. **Algorithm Fundamentals of VLA** - Focuses on foundational knowledge of Vision, Language, and Action modules, and includes a section on deploying and using popular large models [15]. 3. **VLM as an Autonomous Driving Interpreter** - Discusses the role of VLM (Visual Language Model) in scene understanding and covers classic and recent algorithms like DriveGPT4 and TS-VLM [16]. 4. **Modular & Integrated VLA** - Explores the evolution of language models from passive descriptions to active planning components, emphasizing the direct mapping from perception to control [17]. 5. **Reasoning-Enhanced VLA** - Focuses on the trend of integrating reasoning modules into autonomous driving models, highlighting the parallel output of control signals and natural language explanations [18]. 6. **Capstone Project** - Involves practical tasks starting from network construction, allowing participants to customize datasets and fine-tune models, emphasizing hands-on experience [21]. Learning Outcomes - The course aims to advance the understanding of autonomous driving VLA in both academic and industrial contexts, equipping participants with the ability to apply VLA concepts in real-world projects [23]. Course Schedule - The course is set to begin on October 20, with a duration of approximately two and a half months, featuring offline video lectures and online Q&A sessions [24]. Prerequisites - Participants are expected to have a foundational knowledge of autonomous driving, familiarity with transformer models, reinforcement learning, and basic mathematical concepts [25].