VLA
Search documents
具身智能之心技术交流群成立了!
具身智能之心· 2025-11-26 10:00
Group 1 - The establishment of a technical exchange group focused on embodied intelligence, covering areas such as VLA, VLN, remote operation, Diffusion Policy, reinforcement learning, VLA+RL, sim2real, multimodal large models, simulation, motion control, target navigation, mapping and localization, and navigation [1] - Interested individuals can add the assistant's WeChat AIDriver005 to join the community [2] - To expedite the joining process, it is advised to include a note with the institution/school, name, and research direction [3]
VLA+RL方向的合伙人招募了~
具身智能之心· 2025-11-24 10:02
Group 1 - The article discusses the recruitment of instructors for courses and projects related to VLA (Variational Learning Algorithms) and RL (Reinforcement Learning) within the community [1] - The community seeks candidates with a research focus on VLA and RL, preferably holding a PhD or currently enrolled in a doctoral program, and having experience in top conferences in the academic field [2] - For industry candidates, practical experience and hands-on debugging experience with real machines are desired [2] Group 2 - The company, "Embodied Intelligence," is identified as the first comprehensive technical exchange community in China, focusing on VLA and RL, and has gathered a large number of students in these fields [3] - The organization offers compensation above the industry average along with abundant industry resources for the recruited instructors [4] - For further details, interested individuals are encouraged to add a specified WeChat contact for inquiries [5]
认知驱动下的小米智驾,从端到端、世界模型再到VLA......
自动驾驶之心· 2025-11-24 00:03
Core Viewpoint - Xiaomi is making significant investments in intelligent driving technology, focusing on safety, comfort, and efficiency, with safety being the top priority in their development strategy [4][7]. Development Progress - Xiaomi's intelligent driving has progressed through several versions: from high-precision maps for highway NOA (version 24.3) to urban NOA (version 24.5), and moving towards light map and no map versions (version 24.10) [7]. - The company is advancing through three stages of intelligent driving: 1.0 (rule-driven), 2.0 (data-driven), and 3.0 (cognitive-driven), with a focus on VLA (Vision Language Architecture) for the next production phase [7][10]. World Model Features - The world model introduced by Xiaomi has three essential characteristics: diversity in generated scenarios, multimodal input and output, and interactive capabilities that influence vehicle behavior [8][9]. - The world model is designed to enhance model performance through cloud-based data generation, closed-loop simulation, and reinforcement learning, rather than direct action outputs from the vehicle [10]. VLA and Learning Models - VLA is described as an enhancement over end-to-end learning, integrating high-level human knowledge (traffic rules, values) into the driving model [13]. - Xiaomi's development roadmap includes various model training stages, from LLM pre-training to embodied pre-training, with recent advancements in MiMo and MiMo-vl models [13]. Community and Knowledge Sharing - The "Automated Driving Heart Knowledge Sphere" community aims to provide a comprehensive platform for learning and sharing knowledge in the field of autonomous driving, with over 4,000 members and plans to expand [15][26]. - The community offers resources such as technical routes, video tutorials, and Q&A sessions to assist both beginners and advanced learners in the autonomous driving sector [27][30].
VLA+RL方向的同学可以看过来了~
具身智能之心· 2025-11-21 00:04
Group 1 - The article discusses the recruitment of instructors for courses and projects related to VLA (Variational Learning Algorithms) and RL (Reinforcement Learning) within the community [1] - The community seeks candidates with a research focus on VLA and RL, preferably holding a PhD or currently enrolled in a doctoral program, and having experience in top conferences in the academic field [2] - For industry candidates, practical experience and hands-on debugging experience with real machines are desired [2] Group 2 - The company, referred to as "Embodied Intelligence," is the first comprehensive technical exchange community in China, gathering a large number of individuals focused on VLA and RL [3] - The company offers compensation above the industry average along with abundant industry resources for the recruited instructors [4] - For more detailed information, interested individuals are encouraged to add a specified WeChat contact for consultation [5]
自动驾驶三大技术路线:端到端、VLA、世界模型
自动驾驶之心· 2025-11-21 00:04
Overview - The article discusses the ongoing technological competition in the autonomous driving industry, focusing on different approaches to solving corner cases and enhancing safety and efficiency in driving systems [1][3]. Technological Approaches - There is a debate between two main technological routes: single-vehicle intelligence (VLA) and intelligent networking (VLM) [1]. - Major companies like Waymo utilize VLM, which allows AI to handle environmental understanding and reasoning, while traditional modules maintain decision-making control for safety [1]. - Companies such as Tesla, Geely, and XPeng are exploring VLA, aiming for AI to learn all driving skills through extensive data training for end-to-end decision-making [1]. Sensor and Algorithm Developments - The article highlights the evolution of perception technologies, with BEV (Bird's Eye View) perception becoming mainstream by 2022, and OCC (Occupancy) perception gaining traction in 2023 [3][5]. - BEV integrates various sensor data into a unified spatial representation, facilitating better path planning and dynamic information fusion [8][14]. - OCC perception provides detailed occupancy data, clarifying the probability of space being occupied over time, which enhances dynamic interaction modeling [6][14]. Modular and End-to-End Systems - Prior to the advent of multimodal large models and end-to-end autonomous driving technologies, perception and prediction tasks were typically handled by separate modules [5]. - The article outlines a phased approach to modularization, where perception, prediction, decision-making, and control are distinct yet interconnected [4][31]. - End-to-end systems aim to streamline the process by allowing direct mapping from raw sensor inputs to actionable outputs, enhancing efficiency and reducing bottlenecks [20][25]. VLA and VLM Frameworks - VLA (Visual-Language-Action) and VLM (Visual-Language Model) frameworks are discussed, with VLA focusing on understanding complex scenes and making autonomous decisions based on visual and language inputs [32][39]. - The article emphasizes the importance of language models in enhancing the interpretability and safety of autonomous driving systems, allowing for better cross-scenario knowledge transfer and decision-making [57]. Future Directions - The competition between VLA and WA (World Action) architectures is highlighted, with WA emphasizing direct visual-to-action mapping without language mediation [55][56]. - The article suggests that the future of autonomous driving will involve integrating world models that understand physical laws and temporal dynamics, addressing the limitations of current language models [34][54].
基于准确的原始材料对比小鹏理想VLA
理想TOP2· 2025-11-20 10:42
Core Viewpoint - The article discusses the advancements in autonomous driving technology, particularly focusing on the VLA (Vision-Language-Action) architecture developed by Li Auto and the insights shared by Xiaopeng's autonomous driving head, Liu Xianming, during a podcast. Liu emphasizes the removal of the intermediate language component (L) to enhance scalability and efficiency in data usage [1][4][5]. Summary by Sections VLA Architecture and Training Process - The VLA architecture involves a pre-training phase using a 32 billion parameter (32B) vision-language model that incorporates 3D vision and high-definition 2D vision, improving clarity by 3-5 times compared to open-source models. It also includes driving-related language data and key VL joint data [10][11]. - The model is distilled into a 3.2 billion parameter (3.2B) MoE model to ensure fast inference on vehicle hardware, followed by a post-training phase that integrates action to form the VLA, increasing the parameter count to nearly 4 billion [13][12]. - The reinforcement learning phase consists of two parts: human feedback reinforcement learning (RLHF) and pure reinforcement learning using world model-generated data, focusing on comfort, collision avoidance, and adherence to traffic regulations [15][16]. Data Utilization and Efficiency - Liu argues that using language as a supervisory signal can introduce human biases, reducing data efficiency and scalability. The most challenging data to collect are corner cases, which are crucial for training [4][6]. - The architecture aims to achieve a high level of generalization, with plans to implement L4 robotaxi services in Guangzhou based on the current framework [4][5]. Future Directions and Challenges - Liu acknowledges the uncertainties in scaling the technology and ensuring safety, questioning how to maintain safety standards and align the model with human behavior [5][18]. - The conversation highlights that the VLA, VLM, and world model are fundamentally end-to-end architectures, with various companies working on similar concepts in the realm of Physical AI [5][18]. Human-Agent Interaction - The driver agent is designed to process short commands directly, while complex instructions are sent to the cloud for processing before execution. This approach allows the system to understand and interact with the physical world like a human driver [17][18]. - The article concludes that the traffic domain is a suitable environment for VLA implementation due to its defined rules and the ability to model human driving behavior effectively [19][20].
从纯小白到具身算法工程师的打怪之路
具身智能之心· 2025-11-20 04:02
Core Insights - The article discusses the evolution and research directions in Visual Language Action (VLA), Visual Language Navigation (VLN), and reinforcement learning in robotics, highlighting the importance of these technologies in enhancing robot capabilities and performance [1][2][5][9]. VLA Direction - VLA systems consist of visual perception processing, language instruction understanding, and action strategy networks, categorized into three paradigms: explicit end-to-end VLA, implicit end-to-end VLA, and hierarchical end-to-end VLA [1][2]. - Explicit end-to-end VLA compresses visual and language information into a joint representation, which is then mapped to action space, leveraging various architectures and models to achieve good performance [1]. - Implicit end-to-end VLA focuses on interpretability by predicting future states using video diffusion models, enhancing the potential for scaling VLA models [2]. - Hierarchical end-to-end VLA aims to utilize the characteristics of large models to improve generalization while maintaining efficiency for downstream execution [2]. VLN Direction - VLN systems are composed of visual language encoders, environmental history representation, and action strategies, requiring effective information compression from visual and language inputs [5][6]. - The choice of encoder and whether to project visual and language representations into a common space are critical issues, with current trends favoring pre-trained models on large datasets and the use of large language models (LLM) for instruction decomposition [6]. - VLN robots operate in a sequential decision-making task, accumulating historical information to inform future actions, with implicit methods representing past information as latent variables [6]. - Object Navigation within VLN emphasizes identifying target objects based on category information, reducing the need for detailed instructions and enhancing exploration capabilities [7]. Reinforcement Learning & Legged Robots - Reinforcement learning is crucial for legged robots, covering various aspects such as kinematics, dynamics, multi-modal sensor fusion, and advanced algorithms for task adaptation [9][10]. - Key areas include gait planning, balance control for bipedal robots, and the application of deep reinforcement learning and imitation learning for multi-task training [10]. - Techniques like domain randomization and safety mechanisms are essential for ensuring successful real-world deployment of robotic systems [10]. Diffusion Policy - The introduction of diffusion models in robotics has led to significant advancements, with the Diffusion Policy achieving an average performance improvement of 46.9% in various simulation environments [21][22]. - The Robotic Diffusion Transformer (RDT), with 1.2 billion parameters, showcases strong zero-shot generalization capabilities and the ability to learn new skills with minimal examples [22]. - The application of diffusion strategies is expanding beyond robotic manipulation to areas like autonomous navigation and dexterous grasping, enhancing task success rates through real-time environmental adaptation [22][23]. - Recent developments in diffusion strategies include advancements in 3D applications and the integration of safety and online reinforcement learning, opening new research avenues [23].
从技术路线到人员更迭,为什么智能驾驶又开始了“新造词”?
3 6 Ke· 2025-11-19 12:19
Core Insights - The automotive and intelligent driving industry is experiencing rapid technological iterations, leading to new terminologies and concepts that challenge user understanding and acceptance [1] - The transition from rule-based systems to end-to-end and world model architectures is reshaping the landscape of autonomous driving, with significant implications for company strategies and personnel [2][4][10] Industry Trends - The shift towards end-to-end systems, exemplified by Tesla's FSD V12, has prompted other companies like Huawei, Xpeng, and NIO to explore similar approaches, indicating a trend towards more integrated solutions [2][4] - The industry recognizes the upcoming critical period for the implementation of advanced driver assistance technologies, particularly from Q4 2023 to mid-2024, as companies race to adopt and refine these technologies [1] Technical Developments - Current autonomous driving systems, whether rule-based or end-to-end, primarily rely on mimicking human driving through extensive data collection and learning, which presents challenges in efficiency and adaptability [4][5] - The introduction of VLA (vision-language-action) models aims to enhance understanding of the physical world, moving beyond mere imitation to a more human-like comprehension of driving scenarios [7][11] Company Strategies - Companies like Xpeng and Li Auto are pivoting towards VLA models, with Xpeng's second-generation VLA eliminating the language translation step to improve efficiency and data utilization [8][11] - The restructuring of R&D departments within companies such as Li Auto and NIO reflects a strategic shift towards prioritizing VLA and world model approaches, indicating a broader industry trend towards adapting organizational structures to new technological demands [15][17] Competitive Landscape - The competition between self-developed autonomous driving technologies and third-party solutions is intensifying, with companies increasingly opting for partnerships with specialized suppliers to enhance their capabilities [18][21] - The financial burden of self-development is prompting companies to reconsider their strategies, as seen in Xpeng's significant investment in computing resources and the need for profitability in Q4 2023 [19][22]
从技术路线到人员更迭,为什么智能驾驶又开始了“新造词”? | 电厂
Xin Lang Cai Jing· 2025-11-19 10:20
Core Insights - The automotive and smart driving industry is experiencing rapid technological iterations, leading to new terminologies and concepts that challenge user understanding and acceptance [1] - The transition from rule-based systems to end-to-end and world model architectures is reshaping the industry, with significant implications for company strategies and personnel [2][6] Group 1: Technological Evolution - The shift from rule-based to end-to-end systems has highlighted the limitations of modular approaches, particularly in terms of latency and information loss [2] - Tesla's introduction of the end-to-end FSD V12 has sparked interest among other companies like Huawei, Xpeng, and NIO, who are also developing similar solutions [2][5] - The industry is moving towards VLA (vision-language-action) models, which aim to better understand the physical world and improve driving actions [8][12] Group 2: Challenges in Implementation - Current systems, whether rule-based or end-to-end, rely heavily on passive learning from vast amounts of driving data, which limits their ability to adapt to new scenarios [5][6] - The VLA model faces challenges such as multi-modal feature alignment and the inherent limitations of language models in processing complex real-world situations [11][15] - Companies like Ideal Auto and Xpeng are exploring innovative VLA approaches to enhance their systems' capabilities and efficiency [8][12] Group 3: Organizational Adjustments - The transition to new technological routes has led to significant organizational restructuring within companies like Xpeng, Ideal Auto, and NIO, reflecting a shift in focus towards foundational models [13][14] - Xpeng's leadership changes indicate a strategic pivot from traditional VLA to innovative VLA, emphasizing the need for a robust foundational model [14] - NIO and Ideal Auto have also undergone multiple organizational adjustments to align their resources with the evolving technological landscape [15][17] Group 4: Competitive Landscape - The trend of self-research in autonomous driving technology is shifting towards partnerships with specialized suppliers, as seen with companies like Chery and Great Wall [18][19] - Suppliers are gaining an edge in flexibility and rapid iteration capabilities compared to traditional automakers, which face constraints in their development processes [21] - The competition is intensifying, with suppliers expected to play a more dominant role in the market as they advance their solutions [18][22]
从投稿来看,具身方向的论文已经出现了堆积.......
具身智能之心· 2025-11-18 10:00
Core Insights - The article discusses the increasing number of submissions to various conferences and the concerns of researchers regarding the suitability of different conferences and the preferences of reviewers [1] - It highlights the active research directions in embodied intelligence, including VLN, VLA, reinforcement learning, and real2sim2real, and provides guidance for newcomers on how to choose their research focus [1][3] - The article promotes a customized paper mentoring service aimed at helping researchers navigate the complexities of paper writing and submission [3][4][5] Group 1 - The article notes that many researchers are anxious about selecting the right conference and understanding which research directions are favored by reviewers [1] - It emphasizes that humanoid robots are particularly active in reinforcement learning and sim2real/real2sim2real research, suggesting that labs with relevant embodiments should explore these areas [1] - It mentions that mechanical arm embodiments are suitable for VLA, VLA+RL, and diffusion policy research, with a high computational power requirement for VLA [1] Group 2 - The article states that quadrupedal robots are also suitable for reinforcement learning research, although there may be fewer innovative points due to prior extensive work in this area [2] - It suggests that combining VLN and VLA with mobile manipulation could be a promising research direction [3] - The article introduces a paper mentoring service that offers one-on-one customized guidance across various top-tier conference topics, emphasizing the importance of having a good idea and navigating potential pitfalls for new researchers [3][4] Group 3 - The mentoring service covers a full process from topic innovation to experimental design, code debugging, paper writing, and submission strategy, aimed at producing high-quality results quickly [4] - It highlights the dual perspective of both industrial and academic value, focusing not only on publishing papers but also on practical applications [5] - The article offers a free matching service for the first ten inquiries, allowing researchers to have in-depth meetings with mentors based on their research direction and academic background [6]