Workflow
语言
icon
Search documents
一文尽览!近一年自动驾驶VLA优秀工作汇总~
自动驾驶之心· 2025-07-15 12:30
Core Insights - The article discusses the advancements in Vision-Language-Action (VLA) models for autonomous driving, highlighting the integration of navigation and reinforcement learning to enhance reasoning capabilities beyond visual range [2][3][6]. Group 1: NavigScene - NavigScene is introduced as a novel auxiliary dataset that pairs local multi-view sensor inputs with global natural language navigation guidance, addressing the critical gap between local perception and global navigation context in autonomous driving [6]. - Three complementary paradigms are implemented in NavigScene: navigation-guided reasoning, navigation-guided preference optimization, and navigation-guided VLA models, enhancing the reasoning and generalization capabilities of autonomous driving systems [6]. - Comprehensive experiments demonstrate significant performance improvements in perception, prediction, and planning tasks by integrating global navigation knowledge into autonomous driving systems [6]. Group 2: AutoVLA - AutoVLA is proposed as an end-to-end autonomous driving framework that integrates physical action tokens with a pre-trained VLM backbone, enabling direct policy learning and semantic reasoning from raw visual observations and language instructions [12]. - A reinforcement learning-based post-training method using Group Relative Policy Optimization (GRPO) is introduced to achieve adaptive reasoning and further enhance model performance in end-to-end driving tasks [12]. - AutoVLA achieves competitive performance across multiple autonomous driving benchmarks, including open-loop and closed-loop tests [12]. Group 3: ReCogDrive - ReCogDrive is presented as an end-to-end autonomous driving system that integrates VLM with a diffusion planner, employing a three-stage training paradigm to address performance drops in rare and long-tail scenarios [13][16]. - The first stage involves fine-tuning the VLM on a large-scale driving Q&A dataset to mitigate domain gaps between general content and real-world driving scenarios [16]. - The method achieves a state-of-the-art PDMS score of 89.6 on the NAVSIM benchmark, highlighting its effectiveness and feasibility [16]. Group 4: Impromptu VLA - Impromptu VLA introduces a large-scale, richly annotated dataset aimed at addressing the limitations of existing benchmarks in autonomous driving VLA models [22]. - The dataset is designed to enhance the performance of VLA models in unstructured extreme scenarios, demonstrating significant improvements in established benchmarks [22]. - Experiments show that training with the Impromptu VLA dataset leads to notable performance enhancements in closed-loop NeuroNCAP scores and collision rates [22]. Group 5: DriveMoE - DriveMoE is a novel end-to-end autonomous driving framework that incorporates a mixture-of-experts (MoE) architecture to effectively handle multi-view sensor data and complex driving scenarios [28]. - The framework features scene-specific visual MoE and skill-specific action MoE, addressing the challenges of multi-view redundancy and skill specialization [28]. - DriveMoE achieves state-of-the-art performance in closed-loop evaluations on the Bench2Drive benchmark, demonstrating the effectiveness of combining visual and action MoE in autonomous driving tasks [28].
TACTILE-VLA:激活VLA模型的物理知识以实现触觉泛化(清华大学最新)
具身智能之心· 2025-07-15 07:55
Core Insights - The article discusses the development of Tactile-VLA, a model that integrates visual, language, action, and tactile perception to enhance robotic capabilities in contact-intensive tasks [2][6]. Group 1: Background and Core Issues - Visual-language-action (VLA) models have strong semantic understanding and cross-modal generalization capabilities, but they struggle in contact-intensive scenarios due to a lack of tactile perception [2][6]. - Tactile perception provides critical feedback in physical interactions, such as friction and material properties, which are essential for tasks requiring fine motor control [2][6]. Group 2: Key Innovations and Research Goals - The core finding is that VLA models contain prior knowledge of physical interactions, which can be activated by connecting this knowledge with tactile sensors, enabling zero-shot generalization in contact-intensive tasks [6][7]. - Tactile-VLA framework introduces tactile perception as a primary modality, allowing for direct mapping from abstract semantics to physical force control [7]. - The mixed position-force controller innovatively converts force targets into position adjustment commands, addressing the challenge of coordinating position and force control [7]. Group 3: Architecture and Mechanisms - Tactile-VLA's architecture includes four key modules: instruction adherence to tactile cues, application of tactile-related common sense, adaptive reasoning through tactile feedback, and a multi-modal encoder for unified token representation [12][13]. - The mixed position-force control mechanism ensures precision in position while allowing for fine-tuned force adjustments during contact tasks [13]. - The Tactile-VLA-CoT variant incorporates a chain of thought (CoT) reasoning mechanism, enabling robots to analyze failure causes based on tactile feedback and autonomously adjust strategies [13][14]. Group 4: Experimental Validation and Results - Three experimental setups were designed to validate Tactile-VLA's capabilities in instruction adherence, common sense application, and adaptive reasoning [17]. - In the instruction adherence experiment, Tactile-VLA achieved a success rate of 35% in USB tasks and 90% in charger tasks, significantly outperforming baseline models [21][22]. - The common sense application experiment demonstrated Tactile-VLA's ability to adjust interaction forces based on object properties, achieving success rates of 90%-100% for known objects and 80%-100% for unknown objects [27]. - The adaptive reasoning experiment showed that Tactile-VLA-CoT could successfully complete a blackboard task with an 80% success rate, demonstrating its problem-solving capabilities through reasoning [33].
特斯拉Optimus V3,来了!!
Robot猎场备忘录· 2025-07-15 04:18
Core Viewpoint - The recent developments surrounding Tesla's Optimus robot, including order cuts and leadership changes, have led to significant fluctuations in the robotics sector, particularly affecting T-chain concept stocks. However, these changes are seen as necessary adjustments for the future success of the Optimus project, paving the way for the upcoming Optimus V3 model [1][2]. Group 1: Market Reactions - On June 19, news of order cuts for Tesla's Optimus robot supplier caused a decline in the robotics sector, with T-chain concept stocks experiencing significant drops [1]. - Following the confirmation of the order cuts and the postponement of mass production plans, T-chain stocks like Zhejiang Rongtai (603119.SH) hit their daily limit down, while others like Beite Technology and Sanhua Intelligent Control fell over 4% [1]. - Despite a general recovery in the robotics sector on June 23, T-chain stocks continued to decline, indicating ongoing market concerns [1]. Group 2: Project Developments - The Optimus project is undergoing a redesign of its hardware and software, with a short-term reduction in delivery volumes anticipated. However, the project's importance remains unchanged as it aims for a more robust and reliable next-generation product [2]. - Milan Kovac, the original project leader for Optimus, has left the team, likely due to the need for a new direction in technology development [2]. - Elon Musk announced on June 25 that Optimus V3 will integrate the Grok voice assistant, utilizing AI language models for interaction, indicating a significant technological advancement [3]. Group 3: Future Prospects - On July 10, during the xAI launch, Musk reiterated that Grok 4 will be integrated into Tesla's Optimus, aiming for a real-world reinforcement learning loop by the end of the year [5]. - Recent orders for over 100 units of the Optimus robot were reported, suggesting that hardware redesigns are progressing well [6]. - Musk expressed confidence in the latest developments of Optimus, stating that the upcoming demonstrations will be the most impressive to date [6]. Group 4: Industry Dynamics - The T-chain concept stocks have shown positive momentum recently, with several companies releasing favorable news [8]. - The market is closely watching the upcoming Tesla quarterly meeting on July 24 and the shareholder meeting on November 6 for further insights into the Optimus project and its supply chain [9]. - The acquisition of the Sci-Tech Innovation Board listed company by Zhiyuan Robotics has led to a significant increase in market capitalization, highlighting the growing interest in humanoid robotics [9].
“美国已经基本退出,都是中国的”
Guan Cha Zhe Wang· 2025-07-15 04:08
Core Viewpoint - Meta is considering a significant shift in its AI strategy by potentially moving from open-source AI models to closed-source models, which could mark a departure from its long-standing commitment to open-source development [1][5][6] Group 1: Strategic Shift - Meta's newly established "Super Intelligence Lab" (MSL) is contemplating abandoning its powerful open-source AI model, Behemoth, in favor of developing a closed-source model [1][5] - This potential shift is seen as a major strategic change for Meta, which has historically believed that open-source technology fosters faster AI development and broader access for developers [5][6] - The decision is reportedly influenced by the underperformance of the Behemoth model during internal testing, leading to delays in its release [5][6] Group 2: Leadership and Talent Acquisition - Meta has appointed Alexandr Wang, the new AI head, who previously led Scale AI, to oversee the Super Intelligence Lab, which consists of a specialized team of about 12 members [6][7] - The company has adopted a "high-paying talent acquisition" strategy, offering salaries exceeding $100 million to attract top researchers from competitors like OpenAI, Google, and Apple [5][6] Group 3: Market Implications - The shift towards closed-source models could signify a retreat from the competitive landscape of open-source large language models (LLMs), with concerns raised about the U.S. losing its edge in this area [1][3] - The ongoing developments in Meta's AI strategy are closely watched, especially as the company faces challenges in the AI technology sector [5][6]
比Adam更有效,POET从谱不变原理出发,让LLM训练又稳又快
机器之心· 2025-07-15 00:59
Core Viewpoint - The article discusses a novel training paradigm for large language models (LLMs) called POET (Reparameterized Training via Orthogonal Equivalence Transformation), which aims to enhance training efficiency and stability based on first principles [2][3]. Group 1: POET Methodology - POET introduces structural reparameterization of each neuron by incorporating two learnable orthogonal matrices and a fixed random weight matrix, maintaining the singular value distribution of weights during training [3][11]. - The method combines singular value invariance with minimal hyperspherical energy, providing a new paradigm that offers both physical interpretability and generalization capability for large model training [3][11]. - POET's training process is designed to stabilize the optimization process and significantly improve model generalization performance [3][11]. Group 2: Advantages of POET - POET maintains the spectral properties of the weight matrix throughout training, ensuring that the singular values remain consistent with the randomly initialized matrix [17]. - The method allows for efficient parameter control and avoids the issue of excessively large singular values that can occur in standard LLM training [17]. - Two new initialization strategies, normalized Gaussian initialization and uniform spectrum initialization, are proposed to ensure bounded singular values in the generated weight matrices [17]. Group 3: Training Dynamics and Performance - The article presents experimental results demonstrating POET's superior performance in training large language models, including improvements in perplexity and training efficiency compared to traditional methods like AdamW [20][24]. - POET's training process is divided into three phases: conical shell searching, stable learning on the conical shell, and final adjusting, which reflects the evolution of the orthogonal matrices during training [40][41]. - The use of a fully stochastic sampling approach in POET allows for a significant reduction in memory costs compared to traditional methods, enhancing scalability [26][27].
脑机接口 从“解码语言”到更多可能(国际科技前沿)
Ren Min Ri Bao· 2025-07-14 21:57
脑机接口技术通过检测和调控大脑活动,在大脑与外部设备之间建立直接的信息通路,创造了前所未有 的人机交互方式,也让"意念对话"从科幻照进现实。近年来,随着技术迭代发展,多国在相关领域开展 实验探索,特别是在语言脑机接口领域取得了一系列突破:从中风瘫痪患者"脑波转语音"的实时沟通, 到脑控机械臂书写汉字,再到帮助渐冻症患者提高生活质量……这一新型技术正在为语言障碍群体架起 沟通世界的桥梁,也将为治疗神经系统等方面疾病提供新的思路和方案。 可实时将大脑活动转化为言语 大脑是一个强大而孤独的器官,它被颅骨严密保护,负责处理感觉、情感、记忆、决策与运动等信息。 外界信息进入大脑,或大脑信息传至外界,依赖于我们身体的生物信息接口(即感官和神经系统)。现 代科技的发展,使人类开始有能力检测到大脑活动信号,并从中解码所含信息,进而利用这些信息跳过 肌肉系统,直接控制外部设备,相当于在大脑和外部世界间建立了一个人工的信息接口,这就是脑机接 口技术。 这些研究进展为语言脑机接口走向实用奠定了坚实基础。未来更大的挑战可能在于对意图和语义的解 码。目前的研究,主要是解决从控制发声的大脑皮层解码语言运动指令的问题,但有相当一部分失语症 ...
小鹏最新!NavigScene:全局导航实现超视距自动驾驶VLA(ACMMM'25)
自动驾驶之心· 2025-07-14 11:30
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 今天自动驾驶之心为大家分享 中佛罗里达大学和小鹏汽车ACMMM25中稿的最新 工作 - NavigScene ! 连接局部感知和全局导航,实现超视距自动驾驶! 如果您有 相关工作需要分享,请在文末联系我们! 自动驾驶课程学习与技术交流群事宜,也欢迎添加小助理微信AIDriver004做进一 步咨询 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 论文作者 | Qucheng Peng等 编辑 | 自动驾驶之心 写在前面 & 笔者的个人理解 自动驾驶系统在基于局部视觉信息的感知、预测和规划方面取得了显著进展,但它们难以整合人类驾驶员 通常使用的更广泛的导航背景。为此,小鹏汽车的团队提出了NavigScene,期望解决局部传感器数据与全 局导航信息之间的关键差距,NavigScene是一种辅助的导航引导自然语言数据集,可在自主驾驶系统中模 拟类人驾驶环境。此外开发了三种互补的方法来利用NavigScene:(1)导航引导推理,通过在提示方法中 结合导航上下文来增强视觉-语言模型;(2)导航引导偏好优化,这是一 ...
ACL 2025|自我怀疑还是自我纠正?清华团队揭示LLMs反思技术的暗面
机器之心· 2025-07-14 04:08
Core Viewpoint - The research highlights the limitations of intrinsic self-correction techniques in large language models (LLMs), revealing that these models often fail to improve their performance when prompted to "think again," leading to incorrect answers even on simple factual questions [2][24]. Group 1: Reflection Technology Failures - The study systematically evaluates the failures of reflection technology across various LLMs and tasks, finding that failures occur more frequently than successes, even in advanced models [7][8]. - For instance, the reflection failure rate in the Decision Making task for the o1-mini model is higher than that of the o4 and 3.5-turbo models [8]. - Recent evaluations of ChatGPT models (4.5, 4.1, o4-mini, o3) also show significant reflection failure rates, with the o4-mini model experiencing a decrease in accuracy of 22.1% [9]. Group 2: Reasons for Reflection Failures - Three primary reasons for reflection failures are identified: internal answer fluctuation, prompt bias, and cognitive bias [20][24]. - Internal answer fluctuation indicates that LLMs exhibit self-doubt, leading to frequent changes in answers during multi-turn dialogues [12][15]. - Prompt bias shows that LLMs tend to focus excessively on reflection prompts rather than the actual questions, with 76.1% of failures attributed to this issue [18]. - Cognitive bias reveals that LLMs can overthink and generate excessive "think" instructions, resulting in decision-making paralysis [20]. Group 3: Mitigation Strategies - The research proposes two effective mitigation strategies: problem repetition and few-shot fine-tuning [22][24]. - Problem repetition involves appending the initial question to the reflection prompt to maintain focus on the original query [25]. - Few-shot fine-tuning, which does not introduce new knowledge but corrects abnormal behaviors, shows better results in alleviating reflection failures [25].
多模态大模型崛起:华泰证券预测应用奇点即将到来
Sou Hu Cai Jing· 2025-07-13 23:44
Core Insights - The report by Huatai Securities highlights the rapid development of multimodal large models (MLLM) and their applications, indicating that the field is approaching a critical turning point [1][4][15] Development Dynamics - MLLM is seen as an inevitable trend in the evolution of large language models (LLM), integrating capabilities from various modalities to expand application scenarios [1][6] - MLLM can be categorized into modular architecture and native architecture, with the latter showing significant advantages in performance and efficiency, albeit with higher computational and technical requirements [1][6] Commercialization Trends - Global progress in multimodal applications is faster overseas than domestically, with first-tier companies advancing more rapidly than second-tier companies, and multimodal products outpacing text-based products in commercialization [1][7] - Overseas chatbot products, such as those from OpenAI and Anthropic, have achieved annual recurring revenue (ARR) exceeding $1 billion, while domestic chatbot commercialization remains in its early stages [1][7] Video Generation Sector - Domestic companies excel in the video generation field, with products like ByteDance's Seedance 1.0 and Kuaishou's Kling achieving significant market presence [2][8] - Kuaishou's Kling reached an ARR of over $100 million within approximately 10 months of launch, marking a significant milestone in the domestic video generation sector [2][8] Future Outlook - The report anticipates that the singularity of multimodal large models and applications is approaching, driven by technological advancements and accelerated commercialization [5][15] - The integration of multimodal data processing will greatly expand AI's application scenarios, facilitating large-scale applications across various fields [4][15] Investment Opportunities - The report suggests potential investment opportunities in both computational power and application sectors, highlighting the demand for computational resources in native multimodal models and the growing AI needs in advertising, retail, and creative industries [9]
AGI没那么快降临:不能持续学习,AI没法全面取代白领
3 6 Ke· 2025-07-13 23:23
Group 1 - The article discusses the limitations of current AI models, particularly their lack of continuous learning capabilities, which is seen as a significant barrier to achieving Artificial General Intelligence (AGI) [1][6][10] - The author predicts that while short-term changes in AI capabilities may be limited, the probability of a significant breakthrough in intelligence within the next ten years is increasing [1][10][20] - The article emphasizes that human-like continuous learning is essential for AI to reach its full potential, and without this capability, AI will struggle to replace human workers in many tasks [6][10][18] Group 2 - The author expresses skepticism about the timeline for achieving reliable computer operation AI, suggesting that current models are not yet capable of performing complex tasks autonomously [12][13][14] - Predictions are made for the future capabilities of AI, including the potential for AI to handle small business tax operations by 2028 and to achieve human-like learning abilities by 2032 [17][18][19] - The article concludes with a warning that the next decade will be crucial for AI development, with the potential for significant advancements or stagnation depending on breakthroughs in algorithms and learning capabilities [22]