Workflow
VLA
icon
Search documents
VLA+RL方向的合伙人招募了~
具身智能之心· 2025-11-24 10:02
点击下方 卡片 ,关注" 具身智能 之心 "公众号 待遇说明 具身智能之心是国内首个具身全栈技术交流社区,聚集了大量VLA和RL相关方向的同学。 我们将提供高于行业平均水平的薪酬以及丰富的行业资源。 详细内容欢迎添加微信:oooops-life咨询。 一些要求 需是VLA+RL的研究方向,学术界我们希望是博士及以上(包含在读),手握相关方向的顶会。工业界希 望您有一定的实战经验和真机调试经验。 最近收到社区内很多同学关于VLA和RL相关内容的咨询,也希望具身智能之心能够有更深入的讲解。在 此,我们向全平台粉丝招募VLA+RL方向的课程&项目辅导老师,和我们一起输出最有料的内容。 ...
认知驱动下的小米智驾,从端到端、世界模型再到VLA......
自动驾驶之心· 2025-11-24 00:03
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 周末学习了下小米汽车四位大佬接受flypig采访的视频,总结下核心点分享给大家: 在小米智驾迈入认知驱动的上半部分,主要采访的是叶航军博士和陈光博士,叶航军博士是小米智能驾驶的业 务负责人,陈光博士负责端到端量产。 上半部分聊到了几个点: 在智能驾驶的安全、舒适和效率三个维度上,如果一定要做取舍,安全永远是第一位,当然智驾能力的提 升三个维度上都会有提升; 在这期间,叶航军博士聊了小米智驾近期的发展历程: 24.3 高精地图版本的高速NOA → 24.5左右 的城区NOA → 24.10 迈向的轻图和无图版本 → 25.2 月的 300W clips端到端版本 → 25.7月的 1000W clips版本,以及近期推出的世界模型版本。 两位大佬基本上把小米智驾的核心点聊的七七八八,前几年的积累,近两年的量产路程,对世界模型和VLA 的看法,以及对未来发展的一些展望。 更多关于这次采访的分享和看法,柱哥也放到了自动驾驶之心知识星 球,欢迎大家加入一起讨论~ 下半场主要是采访陈龙博士和王乃岩博士,两位分别负责VL ...
VLA+RL方向的同学可以看过来了~
具身智能之心· 2025-11-21 00:04
点击下方 卡片 ,关注" 具身智能 之心 "公众号 最近收到社区内很多同学关于VLA和RL相关内容的咨询,也希望具身智能之心能够有更深入的讲解。在 此,我们向全平台粉丝招募VLA+RL方向的课程&项目辅导老师,和我们一起输出最有料的内容。 具身智能之心是国内首个具身全栈技术交流社区,聚集了大量VLA和RL相关方向的同学。 我们将提供高于行业平均水平的薪酬以及丰富的行业资源。 详细内容欢迎添加微信:oooops-life咨询。 一些要求 需是VLA+RL的研究方向,学术界我们希望是博士及以上(包含在读),手握相关方向的顶会。工业界希 望您有一定的实战经验和真机调试经验。 待遇说明 ...
自动驾驶三大技术路线:端到端、VLA、世界模型
自动驾驶之心· 2025-11-21 00:04
作者 | 深蓝学院 来源 | 自动驾驶最新技术路线总结(分阶段、BEV、端到端、VLA) 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 概述 行业在解决的问题:安全且经济 corner case 技术路线之争 单车智能 vs 智能网联 传感器:视觉 vs 激光雷达 算法架构:模块化 vs 端到端 AI决策:VLM vs VLA vs WA( 去LLM) Waymo 等主流企业采用 VLM ,让 AI 负责环境理解与推理,最终决策权交由传统模块,确保过程可控 特斯拉 、吉利、小鹏等企业探索的 VLA 则试图让 AI 直接学习所有驾驶技巧,通过海量数据训练实现 "端到端" 决策 华为: ADS 4 为代表的WEWA 架构(世界引擎 + 世界动作模型) 图片来源:https://arxiv.org/pdf/2506.24044 规则系统 → 数据驱动 → 认知建模 2022年以前:感知、预测、决策(规划控制) 2022年: BEV 感知成为主流 2023年: OCC 感知 ...
基于准确的原始材料对比小鹏理想VLA
理想TOP2· 2025-11-20 10:42
Core Viewpoint - The article discusses the advancements in autonomous driving technology, particularly focusing on the VLA (Vision-Language-Action) architecture developed by Li Auto and the insights shared by Xiaopeng's autonomous driving head, Liu Xianming, during a podcast. Liu emphasizes the removal of the intermediate language component (L) to enhance scalability and efficiency in data usage [1][4][5]. Summary by Sections VLA Architecture and Training Process - The VLA architecture involves a pre-training phase using a 32 billion parameter (32B) vision-language model that incorporates 3D vision and high-definition 2D vision, improving clarity by 3-5 times compared to open-source models. It also includes driving-related language data and key VL joint data [10][11]. - The model is distilled into a 3.2 billion parameter (3.2B) MoE model to ensure fast inference on vehicle hardware, followed by a post-training phase that integrates action to form the VLA, increasing the parameter count to nearly 4 billion [13][12]. - The reinforcement learning phase consists of two parts: human feedback reinforcement learning (RLHF) and pure reinforcement learning using world model-generated data, focusing on comfort, collision avoidance, and adherence to traffic regulations [15][16]. Data Utilization and Efficiency - Liu argues that using language as a supervisory signal can introduce human biases, reducing data efficiency and scalability. The most challenging data to collect are corner cases, which are crucial for training [4][6]. - The architecture aims to achieve a high level of generalization, with plans to implement L4 robotaxi services in Guangzhou based on the current framework [4][5]. Future Directions and Challenges - Liu acknowledges the uncertainties in scaling the technology and ensuring safety, questioning how to maintain safety standards and align the model with human behavior [5][18]. - The conversation highlights that the VLA, VLM, and world model are fundamentally end-to-end architectures, with various companies working on similar concepts in the realm of Physical AI [5][18]. Human-Agent Interaction - The driver agent is designed to process short commands directly, while complex instructions are sent to the cloud for processing before execution. This approach allows the system to understand and interact with the physical world like a human driver [17][18]. - The article concludes that the traffic domain is a suitable environment for VLA implementation due to its defined rules and the ability to model human driving behavior effectively [19][20].
从纯小白到具身算法工程师的打怪之路
具身智能之心· 2025-11-20 04:02
今天有个老学员,拿到了某头部的offer,自笑到从纯小白到算法工程师的打怪之路着实不简单,但真的有 门路。从自己购买so-100折腾,到后面跟着系统的路线一起学习,不仅节省了很多时间,也避免陷入了较 多的坑里。 这里也为大家推荐几个具身方向的研究路线:涉及vla、vln、diffusion policy、强化学习等。也欢迎扫码直 接学习: vla方向 VLA构成的机器人系统主要包括:视觉的感知处理模块,语言指令的理解以及生成机器人可执行动作的策 略网络。根据不同的需求,目前的VLA主要分为三类范式:显示端到到VLA,隐式端到端VLA以及分层端 到端VLA。 显示端到到VLA,是最常见最经典的范式。通常是将视觉语言信息压缩成联合的表征,然后再基于这个表 征去重新映射到动作空间,生成对应的动作。这类端到端的范式依赖于先前广泛的研究先验,通过不同架 构(diffusion/ transformer/dit),不同的模型大小,不同的应用场景(2d/3d),不同的任务需求(从头训/下 游微调),产生了各类不同的方案,取得了不错的性能。 隐式端到端VLA,则不同于前者,更加关注工作的可解释性,旨在利用当前的video d ...
从技术路线到人员更迭,为什么智能驾驶又开始了“新造词”?
3 6 Ke· 2025-11-19 12:19
Core Insights - The automotive and intelligent driving industry is experiencing rapid technological iterations, leading to new terminologies and concepts that challenge user understanding and acceptance [1] - The transition from rule-based systems to end-to-end and world model architectures is reshaping the landscape of autonomous driving, with significant implications for company strategies and personnel [2][4][10] Industry Trends - The shift towards end-to-end systems, exemplified by Tesla's FSD V12, has prompted other companies like Huawei, Xpeng, and NIO to explore similar approaches, indicating a trend towards more integrated solutions [2][4] - The industry recognizes the upcoming critical period for the implementation of advanced driver assistance technologies, particularly from Q4 2023 to mid-2024, as companies race to adopt and refine these technologies [1] Technical Developments - Current autonomous driving systems, whether rule-based or end-to-end, primarily rely on mimicking human driving through extensive data collection and learning, which presents challenges in efficiency and adaptability [4][5] - The introduction of VLA (vision-language-action) models aims to enhance understanding of the physical world, moving beyond mere imitation to a more human-like comprehension of driving scenarios [7][11] Company Strategies - Companies like Xpeng and Li Auto are pivoting towards VLA models, with Xpeng's second-generation VLA eliminating the language translation step to improve efficiency and data utilization [8][11] - The restructuring of R&D departments within companies such as Li Auto and NIO reflects a strategic shift towards prioritizing VLA and world model approaches, indicating a broader industry trend towards adapting organizational structures to new technological demands [15][17] Competitive Landscape - The competition between self-developed autonomous driving technologies and third-party solutions is intensifying, with companies increasingly opting for partnerships with specialized suppliers to enhance their capabilities [18][21] - The financial burden of self-development is prompting companies to reconsider their strategies, as seen in Xpeng's significant investment in computing resources and the need for profitability in Q4 2023 [19][22]
从技术路线到人员更迭,为什么智能驾驶又开始了“新造词”? | 电厂
Xin Lang Cai Jing· 2025-11-19 10:20
Core Insights - The automotive and smart driving industry is experiencing rapid technological iterations, leading to new terminologies and concepts that challenge user understanding and acceptance [1] - The transition from rule-based systems to end-to-end and world model architectures is reshaping the industry, with significant implications for company strategies and personnel [2][6] Group 1: Technological Evolution - The shift from rule-based to end-to-end systems has highlighted the limitations of modular approaches, particularly in terms of latency and information loss [2] - Tesla's introduction of the end-to-end FSD V12 has sparked interest among other companies like Huawei, Xpeng, and NIO, who are also developing similar solutions [2][5] - The industry is moving towards VLA (vision-language-action) models, which aim to better understand the physical world and improve driving actions [8][12] Group 2: Challenges in Implementation - Current systems, whether rule-based or end-to-end, rely heavily on passive learning from vast amounts of driving data, which limits their ability to adapt to new scenarios [5][6] - The VLA model faces challenges such as multi-modal feature alignment and the inherent limitations of language models in processing complex real-world situations [11][15] - Companies like Ideal Auto and Xpeng are exploring innovative VLA approaches to enhance their systems' capabilities and efficiency [8][12] Group 3: Organizational Adjustments - The transition to new technological routes has led to significant organizational restructuring within companies like Xpeng, Ideal Auto, and NIO, reflecting a shift in focus towards foundational models [13][14] - Xpeng's leadership changes indicate a strategic pivot from traditional VLA to innovative VLA, emphasizing the need for a robust foundational model [14] - NIO and Ideal Auto have also undergone multiple organizational adjustments to align their resources with the evolving technological landscape [15][17] Group 4: Competitive Landscape - The trend of self-research in autonomous driving technology is shifting towards partnerships with specialized suppliers, as seen with companies like Chery and Great Wall [18][19] - Suppliers are gaining an edge in flexibility and rapid iteration capabilities compared to traditional automakers, which face constraints in their development processes [21] - The competition is intensifying, with suppliers expected to play a more dominant role in the market as they advance their solutions [18][22]
从投稿来看,具身方向的论文已经出现了堆积.......
具身智能之心· 2025-11-18 10:00
最近陆续有几个会议结束了投稿,虽然还没开奖,但投稿数量着实很大。也有很多同学着急忙慌地选择转 投其它会议,什么会议更适合自己?什么方向审稿人更青睐?这是很多同学非常关注的点。其中不乏大模 型、传统机器人、机械方向的同学,还有很多新手。 先看看具身的一些方向,vln、vla、强化、还有一些real2sim2real。很多小白不知道如何下手,选择强化学 习还是vla?传统slam还是vln?哪些方向需要较大算力,哪些不需要?除此之外,什么样的本体适合自己研 究,预算不够怎么办?仿真可以吗? 人形机器人在强化与sim2real/real2sim2real研究上较为活跃,如果实验室有相关本体,可以从这几个方向入 手。 为什么选择我们? 剩下就是一些方法论的问题了,有好的idea至关重要。对很多新人研究者,一个好的idea需要踩很多次坑。 如果你还是新人,不知道怎么入门,可以看看我们推出的论文辅导。 论文辅导上线了 【具身智能之心论文辅导重磅上线!多模态大模型/VLA/强化学习/VLN/遥操作/数采/机器人仿 真/real2sim2real/端到端/diffusion等顶会方向1V1定制化辅导】 辅导区间 CCF-A到 ...
从蹒跚学步到模特步,人形机器人大模型做了什么
新财富· 2025-11-18 08:06
Group 1 - The core viewpoint of the article highlights the advancements in humanoid robots, particularly the release of various models like Figure03, 1X Neo, and others, despite the delay of Tesla's Optimus Gen3 until 2026 [2] - The article emphasizes the significant improvement in the movement capabilities of humanoid robots, evolving from awkward movements to more natural and graceful actions, largely due to the development of humanoid robot large models [2] - The article discusses the transition from Large Language Models (LLM) to Vision-Language Models (VLM) and finally to Vision-Language-Action Models (VLA), which integrate perception, understanding, and action in a unified framework [6][8] Group 2 - Google DeepMind introduced VLA with RT-2, which enhances robotic control by integrating visual and language information with action tokens, achieving a success rate improvement from 32% to 62% compared to its predecessor RT-1 [10] - Tesla's Optimus leverages its Full Self-Driving (FSD) model, transitioning to an end-to-end approach that simplifies input complexity while managing a vast amount of data for training [13][15] - NVIDIA's GR00T N1 model represents a comprehensive approach to humanoid robotics, combining hardware, software, and ecosystem development, emphasizing the importance of virtual environments for data collection and training [19][22] Group 3 - The article mentions that various startups are utilizing NVIDIA's large models and Cosmos for their robotic solutions, highlighting the competitive landscape in the humanoid robotics sector [24] - Wang Xingxing expresses skepticism about the VLA architecture, pointing out the inadequacy of existing data quality and quantity for effective real-world interaction, suggesting a need for better model architecture [26][27]