自动驾驶之心
Search documents
从SAM1到SAM3,Meta做了什么?
自动驾驶之心· 2025-12-06 03:04
Core Insights - Meta has made significant advancements in AI, particularly in visual models, with the release of SAM1, SAM2, and SAM3, marking a new era in computer vision technology [1][25]. Summary by Sections SAM1 to SAM3 Evolution - SAM1 introduced Promptable Visual Segmentation (PVS), allowing image segmentation through simple prompts like clicks or semantic hints [1]. - SAM2 optimized the architecture for better video segmentation and dynamic scene support, enhancing stability and accuracy [3]. - SAM3 achieved unprecedented accuracy with multi-modal support, enabling segmentation through voice, text, and images, and introduced Promptable Concept Segmentation (PCS) for complex object recognition [3][4]. Technical Specifications - SAM1 had a smaller model size suitable for real-time inference, while SAM2 improved efficiency and SAM3 enhanced computational capabilities for complex tasks [4]. - SAM3 supports real-time video and image segmentation with multi-object tracking, showcasing its advanced capabilities [4]. - The model allows for long-context semantic reasoning, improving video scene analysis [4]. Concept Segmentation - SAM3 can identify and segment objects based on user-defined concepts, such as "striped cat," demonstrating its flexibility and precision [7][11]. - The model utilizes positive and negative examples to refine segmentation results, enhancing accuracy [10]. Performance Metrics - SAM3 outperformed previous models in various segmentation tasks, achieving high scores across datasets like LVIS, COCO, and others [21][23]. - The model's zero-shot performance was notable, effectively handling tasks without extensive training data [29]. Multi-modal Capabilities - SAM3's integration with MLLMs (Multi-Modal Language Models) allows for complex text queries, enhancing its object segmentation tasks [21][29]. - The model's ability to combine text and image inputs significantly improves segmentation outcomes, showcasing its strength in multi-modal tasks [23]. Conclusion - The advancements from SAM1 to SAM3 reflect Meta's strategic push in the visual AI domain, reshaping various applications in everyday life, including autonomous driving and video surveillance [25][26].
英伟达2025年技术图鉴,强的可怕......
自动驾驶之心· 2025-12-06 03:04
Core Viewpoint - NVIDIA has emerged as a leading player in the AI infrastructure space, achieving a market valuation of $5 trillion, which is an 11-fold increase over three years. The company has transitioned from a graphics chip manufacturer to a key player in AI, particularly in autonomous driving and embodied intelligence [2]. Group 1: NVIDIA's Technological Developments - The Cosmos series, initiated in January, focuses on world foundation models, leading to the development of Cosmos-Transfer1, Cosmos-Reason1, and Cosmos-Predict2.5, which lay the groundwork for autonomous driving and embodied intelligence [5]. - The Nemotron series aims to create a "digital brain" for the agent-based AI era, providing open, efficient, and precise models and tools for enterprises to build specialized AI systems [5]. - The embodied intelligence initiatives include GR00T N1 and Isaac Lab, which focus on simulation platforms and embodied VLA (Vision-Language-Action) models [5]. Group 2: Key Papers and Contributions - The paper "Isaac Lab" presents a GPU-accelerated simulation framework for multi-modal robot learning, addressing challenges in data scarcity and the simulation-to-reality gap [6]. - "Nemotron Nano V2 VL" introduces a 12 billion parameter visual language model that achieves state-of-the-art performance in document understanding and long video reasoning tasks [12]. - "Alpamayo-R1" proposes a visual-language-action model that integrates causal reasoning and trajectory planning to enhance safety and decision-making in autonomous driving [13]. Group 3: Innovations in AI Models - "Cosmos-Predict2.5" introduces a next-generation physical AI video world foundation model that integrates text, image, and video generation capabilities, significantly improving video quality and consistency [17]. - "Cosmos-Reason1" aims to endow multi-modal language models with physical common sense and embodied reasoning capabilities, enhancing their interaction with the physical world [32]. - "GR00T N1" is an open foundation model for generalist humanoid robots, utilizing a dual-system architecture for efficient visual language understanding and real-time action generation [35].
驭势科技环境感知算法工程师招聘(可直推)
自动驾驶之心· 2025-12-06 03:04
Core Viewpoint - The article emphasizes the critical importance of environmental perception algorithms in ensuring the safety of autonomous driving, highlighting the need for skilled professionals in this field [5]. Group 1: Job Responsibilities - The role involves accurately detecting and locating all objects in the surrounding environment, such as roads, pedestrians, vehicles, and bicycles, to ensure safe driving [5]. - Responsibilities include data processing and multi-sensor data fusion for autonomous driving applications, achieving complex perception functions like multi-target tracking and fine-grained semantic understanding [5]. Group 2: Job Requirements - A solid mathematical foundation is required, particularly in geometry and statistics [5]. - Candidates should possess strong knowledge in machine learning and deep learning, with practical experience in cutting-edge technologies [5]. - Experience in algorithms related to scene segmentation, object detection, recognition, tracking, and BEV perception based on vision or LiDAR is essential [5]. - Strong engineering skills are necessary, with proficiency in C/C++ and Python, as well as familiarity with at least one other common programming language [5]. - Understanding of 3D imaging principles and methods, such as stereo, structured light, and ToF, is required [5]. - A deep understanding of computer architecture is needed to develop high-performance, real-time software [5]. - Candidates should have a passion for innovation and a commitment to creating technology that solves real-world problems [5].
寻找散落在各地的自动驾驶热爱者(产品/部署/世界模型等)
自动驾驶之心· 2025-12-06 03:04
Core Viewpoint - The article emphasizes the need for collaboration and innovation in the autonomous driving industry, highlighting the importance of engaging more talented individuals to address the challenges and pain points in the sector [2]. Group 1: Industry Direction - The main focus areas in the autonomous driving field include but are not limited to: product management, 4D annotation/data loop, world models, VLA, large models for autonomous driving, reinforcement learning, and end-to-end systems [4]. Group 2: Job Description - The positions are primarily aimed at training collaborations in the autonomous driving sector, targeting both B-end (enterprises, universities, research institutes) and C-end (students, job seekers) audiences for course development and original content creation [5]. Group 3: Contact Information - For inquiries regarding compensation and collaboration methods, interested parties are encouraged to add the WeChat contact provided for further communication [6].
博世最新一篇长达41页的自动驾驶轨迹规划综述
自动驾驶之心· 2025-12-05 00:03
Core Insights - The article discusses the advancements and applications of foundation models (FMs) in trajectory planning for autonomous driving, highlighting their potential to enhance understanding and decision-making in complex driving scenarios [4][5][11]. Background Overview - Foundation models are large-scale models that learn representations from vast amounts of data, applicable to various downstream tasks, including language and vision [4]. - The study emphasizes the importance of FMs in the autonomous driving sector, particularly in trajectory planning, which is deemed the core task of driving [8][11]. Research Contributions - A classification system for methods utilizing FMs in autonomous driving trajectory planning is proposed, analyzing 37 existing methods to provide a structured understanding of the field [11][12]. - The research evaluates the performance of these methods in terms of code and data openness, offering practical references for reproducibility and reusability [12]. Methodological Insights - The article categorizes methods into two main types: FMs customized for trajectory planning and FMs that guide trajectory planning [16][19]. - Customized FMs leverage pre-trained models, adapting them for specific driving tasks, while guiding FMs enhance existing trajectory planning models through knowledge transfer [19][20]. Application of Foundation Models - FMs can enhance trajectory planning capabilities through various approaches, including fine-tuning existing models, utilizing chain-of-thought reasoning, and enabling language and action interactions [9][19]. - The study identifies 22 methods focused on customizing FMs for trajectory planning, detailing their functionalities and the importance of prompt design in model performance [20][32]. Challenges and Future Directions - The article outlines key challenges in deploying FMs in autonomous driving, such as reasoning costs, model size, and the need for suitable datasets for fine-tuning [5][12]. - Future research directions include addressing the efficiency, robustness, and transferability of models from simulation to real-world applications [12][14]. Comparative Analysis - The study contrasts its findings with existing literature, noting that while previous reviews cover various aspects of autonomous driving, this research specifically focuses on the application of FMs in trajectory planning [13][14]. Data and Model Design - The article discusses the importance of data curation for training FMs, emphasizing the need for structured datasets that include sensor data and trajectory pairs [24][28]. - It also highlights different model design strategies, including the use of existing visual language models and the combination of visual encoders with large language models [27][29]. Language and Action Interaction - The research explores models that incorporate language interaction capabilities, detailing how these models utilize visual question-answering datasets to enhance driving performance [38][39]. - It emphasizes the significance of training datasets and evaluation metrics in assessing the effectiveness of language interaction in trajectory planning [39][41].
端到端时代下的自动驾驶感知
自动驾驶之心· 2025-12-05 00:03
Core Insights - The article discusses the resurgence of end-to-end (E2E) perception in the autonomous driving industry, highlighting its impact on the field and the shift from traditional modular approaches to more integrated solutions [4][5][9]. Group 1: End-to-End Revival - End-to-end is not a new technology; it was initially hoped to directly use neural networks to output trajectories from camera images, but stability and safety were issues [9]. - The traditional architecture of localization, perception, planning, and control has been the mainstream approach, but advancements in BEV perception and Transformer architectures have revived end-to-end methods [9]. - Companies are now exploring various one-stage and two-stage solutions, with a focus on neural network-based planning modules [9]. Group 2: Perception Benefits in End-to-End - In traditional frameworks, perception aimed to gather as much accurate scene information as possible for planning, but this modular design limited the ability to meet planning needs [11]. - Current mainstream end-to-end solutions continue to follow this approach, treating various perception tasks as auxiliary losses [13]. - The key advantage of end-to-end is the shift from exhaustive perception to "Planning-Oriented" perception, allowing for a more efficient and demand-driven approach [14][15]. Group 3: Navigation-Guided Perception - The article introduces a Navigation-Guided Perception model, which suggests that perception should be guided by navigation information, similar to how human drivers focus on relevant scene elements based on driving intent [16][18]. - A Scene Token Learner (STL) module is proposed to efficiently extract scene features based on BEV characteristics, integrating navigation information to enhance perception [18][19]. - The SSR framework demonstrates that only 16 self-supervised queries can effectively represent the necessary perception information for planning tasks, significantly reducing the complexity compared to traditional methods [22]. Group 4: World Models and Implicit Supervision - The article discusses the potential of world models to replace traditional perception tasks, providing implicit supervision for scene representation [23][21]. - The SSR framework aims to enhance understanding of scenes through self-supervised learning, predicting future BEV features to improve scene query comprehension [20][21]. - The design allows for efficient trajectory planning while maintaining consistency for model convergence during training [20]. Group 5: Performance Metrics - The SSR framework outperforms various state-of-the-art (SOTA) methods in both efficiency and performance, achieving significant improvements in metrics such as L2 distance and collision rates [24]. - The framework's design allows for a reduction in the number of queries needed for effective scene representation, showcasing its scalability and efficiency [22][24].
做自动驾驶VLA的这一年
自动驾驶之心· 2025-12-05 00:03
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 最近和星球一位嘉宾大佬交流了下,聊了他做一年VLA的心得,分享给大家。 一辆搭载 VLA 模型的自动驾驶车辆,它 不仅能自己开车,还能解释自己为什么这么开,甚至能听懂人类的指 令 。"解释自己为什么这么开",其实就是 text language , 它生成了描述当前场景的便于人类理解的文本 信息。"理解人类指令"就会涉及到 verbal language 和 action in language 。比如司机对它说 "前 面有点窄,请小心点过" ,前半句其实就是 verbal language ,人类提供了一个场景描述,模型需要理解 人类的语言,而后半句也是人类提供的一个 action 指令 ,辅助模型去生成正确且有效的 action。最后模 型 decode 出一个合理的轨迹或者说 action,即为最终的 output action 。 文章完整版链接: 做自动驾驶VLA的这一年 , 更多自动驾驶VLA的前沿技术,欢迎加入星球和嘉宾交流! 为什么会出现 VLA? 在自动驾驶的发展史上,其实一路走来可以分为 ...
入门自动驾驶实操,全栈小车黑武士001性价比拉满了!
自动驾驶之心· 2025-12-05 00:03
Core Viewpoint - The article introduces the "Black Warrior 001," a cost-effective and easy-to-use autonomous driving educational platform designed for research and teaching purposes, priced at 36,999 yuan, which includes various courses and practical applications for students and researchers [3][5]. Group 1: Product Overview - The Black Warrior 001 is a lightweight solution that supports multiple functionalities such as perception, localization, fusion, navigation, and planning, making it suitable for undergraduate learning, graduate research, and training institutions [5]. - The product is designed to be user-friendly, allowing beginners to quickly engage in hands-on practice with autonomous driving systems [3][5]. Group 2: Performance Demonstration - The platform has been tested in various environments, including indoor, outdoor, and parking scenarios, showcasing its capabilities in perception, localization, fusion, navigation, and planning [7]. - Specific tests include outdoor park driving, point cloud 3D target detection, indoor 2D and 3D laser mapping, and night driving in outdoor settings [9][11][13][15][21]. Group 3: Hardware Specifications - Key sensors include a Mid 360 3D LiDAR, a 2D LiDAR from Lidar, a depth camera from Orbbec, and a main control chip, Nvidia Orin NX with 16GB RAM [23][24]. - The vehicle weighs 30 kg, has a battery power of 50W, operates at 24V, and has a maximum speed of 2 m/s [26][27]. Group 4: Software and Functionality - The software framework includes ROS, C++, and Python, supporting one-click startup and providing a development environment [29]. - The platform supports various functionalities such as 2D and 3D SLAM, target detection, navigation, and obstacle avoidance [30]. Group 5: After-Sales and Maintenance - The company offers one year of after-sales support for non-human damage, with free repairs for damages caused by user errors during the warranty period [53].
五年,终于等来Transformers v5
自动驾驶之心· 2025-12-04 03:03
Core Insights - The article discusses the release of Transformers v5.0.0rc0, marking a significant evolution in the AI infrastructure library after a five-year development cycle from v4 to v5 [3] - The update highlights the growth of the Transformers library, with daily downloads increasing from 20,000 to over 3 million and total installations surpassing 1.2 billion since the v4 release in November 2020 [3] - The new version focuses on four key dimensions: simplicity, transition from fine-tuning to pre-training, interoperability with high-performance inference engines, and making quantization a core feature [3] Simplification - The primary focus of the team is on simplicity, aiming for a clean and clear integration of models, which will enhance standardization, versatility, and community support [5][6] - The library has adopted a modular design approach, facilitating easier maintenance and faster integration, while promoting collaboration within the community [10] Model Updates - Transformers serves as a toolbox for model architectures, with the goal of including all the latest models and becoming the trusted source for model definitions [7] - Over the past five years, an average of 1-3 new models has been added weekly [8] Model Conversion Tools - Hugging Face is developing tools to identify similarities between new models and existing architectures, aiming to automate the model conversion process into the Transformers format [13][14] Training Enhancements - The v5 version emphasizes support for pre-training, with redesigned model initialization and broader compatibility with optimization operators [20] - Hugging Face continues to collaborate with fine-tuning tools in the Python ecosystem and is ensuring compatibility with tools in the JAX ecosystem [21] Inference Improvements - Inference is a key area of optimization in v5, with updates including dedicated kernels, cleaner default settings, new APIs, and enhanced support for inference engines [22][25] - The goal is not to replace specialized inference engines but to achieve compatibility with them [25] Local Deployment - The team collaborates with popular inference engines to ensure that models added to Transformers are immediately available and can leverage the advantages of these engines [27] - Hugging Face is also working on local inference capabilities, allowing models to run directly on devices, with expanding support for multimodal models [28] Quantization - Quantization is becoming a standard in modern model development, with many state-of-the-art models being released in low-precision formats such as 8-bit and 4-bit [29]
李弘扬团队最新!SimScale:显著提升困难场景的端到端仿真框架,NavSim新SOTA
自动驾驶之心· 2025-12-04 03:03
Core Viewpoint - The article discusses the limitations of current data scaling methods in autonomous driving and introduces SimScale, a framework designed to generate critical driving scenarios through scalable 3D simulation, enhancing the performance of end-to-end driving models without the need for more real-world data [2][5][44]. Background Review - Data scaling has been a fundamental principle in modern deep learning across various fields, including language and vision. In autonomous driving, end-to-end planning leverages large-scale driving data to create fully autonomous systems [5][44]. SimScale Framework - SimScale is a simulation generation framework that utilizes high-fidelity neural rendering to create diverse reactive traffic scenarios and pseudo-expert demonstrations. It integrates simulation and real-world data to enhance the robustness and generalization of various end-to-end models [6][12][44]. Simulation Data Generation - The framework employs a 3D Gaussian Splatting (3DGS) simulation data engine to control the states of the vehicle and other agents over time, rendering multi-view videos from the vehicle's perspective. This process involves perturbing vehicle trajectories to maximize state space coverage and generating corresponding expert trajectories for comparison [13][15][19]. Experimental Results - The results from the navhard and navtest benchmark tests show significant performance improvements across all models, with GTRS-Dense achieving a score of 47.2 on navhard, marking a new state-of-the-art performance. The integration of simulation data enhances model robustness in challenging and unseen scenarios [30][31][32][44]. Data Scaling Analysis - The study analyzes the scaling behavior of different planners under fixed real-world data conditions, revealing that the performance of planners improves predictably with increased simulation data. The exploration of pseudo-expert behaviors and interactive environments significantly enhances the effectiveness of simulation data [33][38][39][44]. Conclusion - SimScale demonstrates how large-scale simulation can amplify the value of real-world datasets in end-to-end autonomous driving. The framework's ability to generate pseudo-expert data and its collaborative training approach lead to notable improvements in model performance, emphasizing the importance of simulation in the development of autonomous driving technologies [44].