自动驾驶之心

Search documents
TACTILE-VLA:激活VLA模型的物理知识以实现触觉泛化(清华大学最新)
自动驾驶之心· 2025-07-16 04:05
Core Insights - The article discusses the development of Tactile-VLA, a model that integrates visual, language, action, and tactile perception to enhance robotic capabilities in contact-intensive tasks [2][6][20]. Group 1: Background and Core Issues - Visual-language-action (VLA) models are crucial for general-purpose robotic agents, but they struggle in contact-intensive scenarios due to a lack of tactile perception [2]. - Tactile perception provides essential feedback for physical interactions, which is often missing in existing models [2]. Group 2: Key Innovations and Research Goals - The core finding is that VLA models contain prior knowledge of physical interactions, which can be activated through tactile sensors for zero-shot generalization in contact tasks [6]. - Tactile-VLA framework introduces tactile perception as a primary modality, allowing direct mapping from abstract semantics to physical force control [6]. - The mixed position-force controller innovatively converts force targets into position adjustment commands, addressing the challenge of coordinating position and force control [6][10]. - Tactile-VLA-CoT variant incorporates a chain of thought (CoT) reasoning mechanism, enabling robots to analyze failure causes and autonomously adjust strategies [6][14]. Group 3: Overall Architecture - Tactile-VLA's architecture features four key modules, emphasizing token-level fusion through a non-causal attention mechanism for true semantic representation rooted in physical reality [9]. Group 4: Mixed Position-Force Control Mechanism - The mixed control strategy prioritizes position control while introducing force feedback adjustments when necessary, ensuring precision in movement and force control [10][12]. - The design separates external net force from internal grasping force, allowing for refined force adjustments suitable for contact-intensive tasks [13]. Group 5: Chain of Thought Reasoning Mechanism - Tactile-VLA-CoT enhances adaptive capabilities by transforming the adjustment process into an interpretable reasoning process, improving robustness in complex tasks [14][15]. Group 6: Data Collection Methods - A specialized data collection system was developed to obtain high-quality tactile-language aligned data, addressing the issue of missing force feedback in traditional remote operations [16][19]. Group 7: Experimental Validation and Results Analysis - Three experimental groups were designed to validate Tactile-VLA's capabilities in instruction following, common sense application, and adaptive reasoning [20]. - In the instruction following experiment, Tactile-VLA demonstrated the ability to learn the semantic meaning of force-related language, achieving a success rate of 35% in USB tasks and 90% in charger tasks [23]. - The model effectively utilized common sense knowledge to adjust interaction forces based on object properties, achieving significant performance improvements over baseline models [24][30]. - In the adaptive reasoning experiment, Tactile-VLA-CoT achieved an 80% success rate in a blackboard task, showcasing its ability to diagnose and correct failures autonomously [28][32].
每秒20万级点云成图,70米测量距离!这个3D扫描重建真的爱了!
自动驾驶之心· 2025-07-16 04:05
Core Viewpoint - GeoScan S1 is presented as a highly cost-effective handheld 3D laser scanner, designed for various operational fields with features such as lightweight design, one-button operation, and centimeter-level precision in real-time 3D scene reconstruction [1][4]. Group 1: Product Features - The GeoScan S1 can generate point clouds at a rate of 200,000 points per second, with a maximum measurement distance of 70 meters and 360° coverage, supporting large scenes over 200,000 square meters [1][23][24]. - It integrates multiple sensors, including RTK, 3D laser radar, and dual wide-angle cameras, allowing for high-precision mapping and real-time data output [7][21][28]. - The device operates on a handheld Ubuntu system and features a built-in power supply for various sensors, enhancing its usability [2][3]. Group 2: Performance and Efficiency - The scanner is designed for ease of use, with a simple one-button start for scanning tasks and immediate usability of the exported results without complex deployment [3][4]. - It boasts high efficiency and accuracy in mapping, with relative accuracy better than 3 cm and absolute accuracy better than 5 cm [16][21]. - The device supports real-time modeling and detailed restoration through multi-sensor fusion and microsecond-level data synchronization [21][28]. Group 3: Market Position and Pricing - GeoScan S1 is marketed as the most cost-effective option in the industry, with a starting price of 19,800 yuan for the basic version, and various configurations available for higher prices [4][51]. - The product has been validated through numerous projects and collaborations with academic institutions, indicating a strong background and reliability [3][4]. Group 4: Application Scenarios - The scanner is suitable for a wide range of environments, including office buildings, parking lots, industrial parks, tunnels, forests, and mines, effectively completing 3D scene mapping [32][40]. - It can be integrated with various platforms such as drones, unmanned vehicles, and robots, facilitating unmanned operations [38][40]. Group 5: Technical Specifications - The device dimensions are 14.2 cm x 9.5 cm x 45 cm, weighing 1.3 kg without the battery and 1.9 kg with the battery, with a battery life of approximately 3 to 4 hours [16][17]. - It supports various data export formats, including PCD, LAS, and PLY, and features a storage capacity of 256 GB [16][17].
自动驾驶之心求职辅导推出啦!1v1定制求职服务辅导~
自动驾驶之心· 2025-07-15 12:30
Core Viewpoint - The article introduces a personalized job coaching service aimed at individuals seeking to transition into the intelligent driving sector, focusing on enhancing their skills and improving their job application materials [2][8]. Coaching Scope - Basic services include personalized assessments of the learner's knowledge structure and skills, development of a detailed learning plan, provision of learning materials, and regular Q&A sessions [8]. - The service also offers resume optimization suggestions and potential job referrals based on the learner's profile [9]. Pricing Structure - The coaching service is priced at 8000 per person, which includes a minimum of 10 one-on-one online meetings, each lasting at least one hour [4]. Advanced Services - Advanced services include practical project opportunities that can be added to resumes and simulated interviews that encompass both HR and business aspects [11]. Targeted Positions - The coaching program is designed for various roles within the intelligent driving field, such as intelligent driving product manager, system engineer, algorithm developer, software engineer, testing engineer, and industry analyst [11]. Instructor Background - Instructors are industry experts with over 8 years of experience, having worked with leading autonomous driving companies and major automotive manufacturers [12].
多模态大模型强化学习训练框架 - EasyR1代码走读(GRPO)
自动驾驶之心· 2025-07-15 12:30
Core Insights - The article discusses the exploration of the EasyR1 framework for multi-modal reinforcement learning, particularly focusing on its implementation and configuration for training models like Qwen2.5-VL [1][4][6]. Group 1: Framework Overview - EasyR1 is derived from the verl framework and is designed for language-based reinforcement learning [1][6]. - The code version referenced is approximately from June 10, indicating ongoing updates and improvements [1]. Group 2: Configuration Details - The configuration file is structured into four main categories: data, algorithm, worker, and trainer, with specific parameters outlined for each [6][11]. - Data configurations include paths for training and validation files, maximum prompt and response lengths, and batch sizes for training iterations [9][10]. - Algorithm configurations specify parameters for the advantage estimator, discount factors, and KL divergence settings [11][13]. Group 3: Training Workflow - The training process is initiated through a main script that sets up the data loaders and begins the training loop [42][43]. - The workflow includes steps for preparing data, generating sequences, and computing rewards, with specific attention to balancing batch sizes across distributed processes [46][50][64]. - The article emphasizes the importance of handling multi-modal data and ensuring that the training process accommodates various input types [65][66]. Group 4: Data Handling - The dataset must include specific keys such as problem, answer, and images, formatted in JSON for compatibility with the loading functions [40][41]. - The data loading process supports multiple file formats and is designed to create a seamless pipeline for training [41][32]. Group 5: Model Update Mechanism - The article outlines the mechanism for updating the actor model, detailing how policy loss is computed and how gradients are managed during training [82][86]. - It highlights the significance of KL divergence in the training process, particularly in relation to the reference model [71][80].
蔚来,亮出了新的底牌。。。
自动驾驶之心· 2025-07-15 12:30
以下文章来源于智能车参考 ,作者有车有据 智能车参考 . 追踪车圈先进技术|好用产品|新进展和认知 作者 | 有车有据 来源 | 智能车参考 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 19.39万 ,李斌掀开蔚来的关键底牌。 乐道L90开启预售,27.99万的整车购买价格,已经出人意料。 19.39万元的BaaS方案一出,直接引爆现场。 中国人为什么能造出这样好看、合规又实惠的车? | 这真是耐人寻味。 他们是怎么造出这么好看、各方面都符合要求、价格又这么实惠的车的? 中国人除了劳 | | --- | | 动力成本低之外,还有什么美国人没有的? 这可是个严肃的问题。 | 同时也点燃了资本市场,发布会后蔚来美股股价 大涨超6% 。 | US INIU 厨不 | | | | | 酒版 | 笑版 | | --- | --- | --- | --- | --- | --- | --- | | 3.690美元 +0.210 +6.03% | | | | | 盘后 3.72 ...
一文尽览!近一年自动驾驶VLA优秀工作汇总~
自动驾驶之心· 2025-07-15 12:30
Core Insights - The article discusses the advancements in Vision-Language-Action (VLA) models for autonomous driving, highlighting the integration of navigation and reinforcement learning to enhance reasoning capabilities beyond visual range [2][3][6]. Group 1: NavigScene - NavigScene is introduced as a novel auxiliary dataset that pairs local multi-view sensor inputs with global natural language navigation guidance, addressing the critical gap between local perception and global navigation context in autonomous driving [6]. - Three complementary paradigms are implemented in NavigScene: navigation-guided reasoning, navigation-guided preference optimization, and navigation-guided VLA models, enhancing the reasoning and generalization capabilities of autonomous driving systems [6]. - Comprehensive experiments demonstrate significant performance improvements in perception, prediction, and planning tasks by integrating global navigation knowledge into autonomous driving systems [6]. Group 2: AutoVLA - AutoVLA is proposed as an end-to-end autonomous driving framework that integrates physical action tokens with a pre-trained VLM backbone, enabling direct policy learning and semantic reasoning from raw visual observations and language instructions [12]. - A reinforcement learning-based post-training method using Group Relative Policy Optimization (GRPO) is introduced to achieve adaptive reasoning and further enhance model performance in end-to-end driving tasks [12]. - AutoVLA achieves competitive performance across multiple autonomous driving benchmarks, including open-loop and closed-loop tests [12]. Group 3: ReCogDrive - ReCogDrive is presented as an end-to-end autonomous driving system that integrates VLM with a diffusion planner, employing a three-stage training paradigm to address performance drops in rare and long-tail scenarios [13][16]. - The first stage involves fine-tuning the VLM on a large-scale driving Q&A dataset to mitigate domain gaps between general content and real-world driving scenarios [16]. - The method achieves a state-of-the-art PDMS score of 89.6 on the NAVSIM benchmark, highlighting its effectiveness and feasibility [16]. Group 4: Impromptu VLA - Impromptu VLA introduces a large-scale, richly annotated dataset aimed at addressing the limitations of existing benchmarks in autonomous driving VLA models [22]. - The dataset is designed to enhance the performance of VLA models in unstructured extreme scenarios, demonstrating significant improvements in established benchmarks [22]. - Experiments show that training with the Impromptu VLA dataset leads to notable performance enhancements in closed-loop NeuroNCAP scores and collision rates [22]. Group 5: DriveMoE - DriveMoE is a novel end-to-end autonomous driving framework that incorporates a mixture-of-experts (MoE) architecture to effectively handle multi-view sensor data and complex driving scenarios [28]. - The framework features scene-specific visual MoE and skill-specific action MoE, addressing the challenges of multi-view redundancy and skill specialization [28]. - DriveMoE achieves state-of-the-art performance in closed-loop evaluations on the Bench2Drive benchmark, demonstrating the effectiveness of combining visual and action MoE in autonomous driving tasks [28].
双非研究生,今年找工作有些迷茫。。。
自动驾驶之心· 2025-07-14 14:04
Core Viewpoint - The article emphasizes the importance of staying updated with cutting-edge technologies in the fields of autonomous driving and embodied intelligence, highlighting the need for strong technical skills and knowledge in advanced areas such as large models, reinforcement learning, and 3D graphics [4][5]. Group 1: Industry Trends - There is a growing demand for talent in the fields of robotics and embodied intelligence, with many startups receiving significant funding and showing rapid growth potential [4][5]. - Major companies are shifting their focus towards more advanced technologies, moving from traditional methods to end-to-end solutions and large models, indicating a technological evolution in the industry [4][5]. - The community aims to build a comprehensive ecosystem that connects academia, products, and recruitment, fostering a collaborative environment for knowledge sharing and job opportunities [6]. Group 2: Technical Directions - The article outlines four key technical directions in the industry: visual large language models, world models, diffusion models, and end-to-end autonomous driving [9]. - It provides resources and summaries of various research papers and datasets related to these technologies, indicating a strong emphasis on research and development [10][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][35][36][38]. Group 3: Community and Learning Resources - The community offers a variety of learning materials, including video courses, hardware, and coding resources, aimed at equipping individuals with the necessary skills for the evolving job market [6]. - There is a focus on creating a supportive environment for discussions on the latest industry trends, technical challenges, and job opportunities, which is crucial for professionals looking to advance their careers [6].
VLA盛行的时代,为什么这家公司坚持量产非端到端方案?
自动驾驶之心· 2025-07-14 11:54
Core Viewpoint - Company A has adopted a low-cost strategy and a gradual approach, avoiding large-scale investments in end-to-end solutions, which has allowed it to meet requirements with existing modular solutions [1] Group 1: Company Strategy - Company A's main strategy involves a two-phase approach using multi-sensing and prediction (map, od, occ, etc.) to maintain a competitive edge as a mid-to-low-cost supplier [1] - The company has accumulated a significant amount of relevant training data early on, making modular solutions appear to be the most cost-effective option [1] - Many Tier 1 suppliers are likely to continue with their existing production solutions unless end-to-end solutions can demonstrably outperform modular solutions across multiple fields [1] Group 2: Industry Context - The industry faces a common challenge where companies, including Company A, struggle with the high costs of advanced research and development, which impacts their ability to scale production [1] - The reliance on modular solutions is partly due to the importance of supply chain stability, as manufacturers prefer solutions that are compatible with existing ecosystems [1]
现在自动驾驶领域的行情怎么样了?都有哪些方案?
自动驾驶之心· 2025-07-14 11:30
Core Insights - The article discusses the current state of the autonomous driving industry, including job opportunities and technological advancements [1][10]. Autonomous Driving Levels and Applications - Main functions of autonomous driving include driving, parking, cabin interaction, and V2X (Vehicle-to-Everything) communication [1]. Key System Components - The core systems consist of chips, software, and sensors [3]. Technology Trends Overview - Traditional autonomous driving pipeline is being complemented by end-to-end autonomous driving and various algorithmic solutions like VLM (Vision-Language Model) and VLA (Vision-Language Alignment) [5][6][7]. Major Players in the Industry - New entrants in the market include companies like Xpeng, Li Auto, NIO, Huawei, and others, while established manufacturers include BYD, Geely, and international brands like Mercedes and Volkswagen [7]. - Suppliers in the industry include listed companies such as Horizon Robotics and Momenta, as well as major tech firms like Baidu and Didi [7]. Job Positions and Directions - Traditional roles focus on localization and mapping, perception layers, and post-fusion techniques, while new roles emphasize end-to-end algorithms, reinforcement learning, and data loop experience [8]. Community and Resources - The AutoRobo Knowledge Planet serves as a hub for job seekers in autonomous driving and related fields, offering interview questions, industry reports, and resume optimization services [10][11][22].
自动驾驶圆桌论坛 | 聊聊自动驾驶上半年都发生了啥?
自动驾驶之心· 2025-07-14 11:30
Core Viewpoint - The article discusses the current state and future directions of autonomous driving technology, highlighting the maturity of certain technologies, the challenges that remain, and the emerging trends in the industry. Group 1: Current Technology Maturity - The introduction of BEV (Bird's Eye View) and OCC (Occupancy) perception methods has matured, with no major players claiming that BEV is unusable [2][13] - The main challenge remains corner cases, where 99% of scenarios are manageable, but complex situations like rural roads and large intersections still pose difficulties [13] - E2E (End-to-End) models have not yet demonstrated clear advantages over two-stage models in practical applications, despite their theoretical appeal [4][5] Group 2: Emerging Technologies - VLA (Vision-Language Alignment) is gaining attention as it simplifies tasks and potentially addresses corner cases more effectively than traditional methods [5][6] - The efficiency of models is a critical issue, with discussions around using smaller models to achieve performance close to larger ones [6][30] - Reinforcement learning has not yet proven to be significantly impactful in autonomous driving, with a need for better simulation environments to validate its effectiveness [7][51] Group 3: Future Directions - There is a consensus that VLA and VLM (Vision-Language Model) will be key areas for future development, focusing on enhancing reasoning capabilities and safety [45][48] - The industry is moving towards a more data-driven approach, where the efficiency of data collection, cleaning, and training will determine competitive advantage [28][40] - The integration of world models and closed-loop simulations is seen as essential for advancing autonomous driving technologies [47][50] Group 4: Industry Perspectives - The shift towards VLA/VLM is viewed as a necessary evolution, with the potential to improve user experience and safety in autonomous vehicles [28][45] - The debate between deepening expertise in autonomous driving versus transitioning to embodied intelligence reflects the industry's evolving landscape and personal career choices [22][27] - The current focus on safety and robustness in L4 (Level 4) autonomous driving indicates a divergence in technical approaches between L2+ and L4 players [25][36]