Workflow
VLA模型
icon
Search documents
用SO-100,竟然完成这么多VLA实战......
具身智能之心· 2025-12-13 01:02
Core Viewpoint - The article discusses the challenges and complexities faced by beginners in implementing VLA (Vision-Language Alignment) models, emphasizing the need for practical experience and effective training methods to achieve successful deployment in real-world applications [2][4]. Group 1: Challenges in VLA Implementation - Many students report difficulties in achieving effective results with open-source models like GR00T and PI0, despite low training loss in simulations [2][4]. - The transition from simulation to real-world application (sim2real) poses significant challenges, particularly in data collection and model training [6][7]. - Beginners often struggle with the intricacies of VLA models, leading to prolonged periods of trial and error without achieving satisfactory outcomes [4][6]. Group 2: VLA Model Components - Data collection methods for VLA primarily include imitation learning and reinforcement learning, with a focus on high-quality data acquisition [6]. - Training VLA models typically requires extensive simulation debugging, especially when real-world data is insufficient, utilizing frameworks like Mujoco and Isaac Gym [7]. - Post-training, models often require optimization techniques such as quantization and distillation to reduce parameter size while maintaining performance [9]. Group 3: Educational Initiatives - The article introduces a practical course aimed at addressing the learning curve associated with VLA technologies, developed in collaboration with industry experts [10][12]. - The course covers a comprehensive range of topics, including hardware, data collection, VLA algorithms, and real-world experiments, designed to enhance practical skills [12][25]. - The course is targeted at individuals seeking to enter or advance in the field of embodied intelligence, with prerequisites including a foundational knowledge of Python and PyTorch [22].
全球强化学习+VLA范式,PI*0.6背后都有这家公司技术伏笔
具身智能之心· 2025-12-13 01:02
Core Insights - The article discusses the advancements in embodied intelligence, particularly focusing on the VLA (Vision-Language-Action) model and its integration with reinforcement learning (RL) to enhance robotic capabilities [2][4][50]. Group 1: Importance of VLA and RL - VLA models are crucial for applying powerful visual-language models in robotic control, moving beyond mere imitation learning to achieve robust performance in novel situations [6][8]. - Traditional imitation learning is limited, as robots struggle in unfamiliar scenarios, necessitating the use of RL for continuous improvement through trial and error [8][12]. Group 2: Challenges in Applying RL to VLA - There are three main challenges in applying RL to VLA: environmental differences, model instability, and computational demands [12][13]. - Directly applying RL to large VLA models can lead to catastrophic forgetting and training collapse, making it difficult to maintain performance [12][13]. Group 3: iRe-VLA Model Design - The iRe-VLA model features a two-stage iterative learning process, combining exploration through online RL and consolidation via supervised learning [16][21]. - The architecture includes a VLM backbone for understanding and an Action Head for executing control signals, optimized using LoRA technology to reduce computational load [17][18]. Group 4: Experimental Results - Experiments in simulated environments (MetaWorld, Franka Kitchen) and real-world scenarios demonstrated that iRe-VLA significantly outperformed traditional methods, with success rates improving from 43% to 83% in certain tasks [38][39]. - In real-world applications, the model's success rate for grasping previously unseen objects increased from 35% to 80% after training, showcasing its enhanced generalization capabilities [40][43]. Group 5: Conclusion and Future Directions - The iRe-VLA approach presents a viable solution for deploying large models in robotic control, highlighting the potential for ongoing research in efficient exploration and stable RL algorithms [48][50]. - The model's design allows for effective resource allocation, with local robots handling lightweight tasks while cloud servers manage heavier computations, aligning with practical deployment scenarios [54].
全球强化学习+VLA范式,PI*0.6背后都有这家中国公司技术伏笔
机器之心· 2025-12-12 03:41
Core Insights - The article discusses the significance of integrating Vision-Language-Action (VLA) models with Reinforcement Learning (RL) in the field of Embodied AI, emphasizing the limitations of imitation learning and the necessity for robust learning methods [1][2][4]. Group 1: Importance of VLA+RL - VLA models are being developed to apply powerful Vision-Language Models (VLM) in the control of robots, primarily through supervised fine-tuning (SFT) [2]. - Imitation learning alone is insufficient for robots to handle novel situations, necessitating the use of RL to enhance robustness and persistence in task execution [4]. Group 2: Challenges in Applying RL to VLA - The integration of RL with VLA faces three main challenges: environmental differences, model instability, and computational demands [6]. - Direct application of RL algorithms to large VLA models can lead to catastrophic forgetting and training collapse, making it difficult to maintain performance [6]. Group 3: Solutions to VLA's RL Challenges - The industry has proposed three types of solutions to address the challenges faced by VLA in RL applications, with a focus on internalizing high-value behaviors through SFT [7][13]. - The iRe-VLA model introduces a two-phase iterative learning process that alternates between online RL for exploration and supervised learning for consolidation [10][15]. Group 4: iRe-VLA Model Architecture - The iRe-VLA model consists of a VLM backbone for understanding images and instructions, and an Action Head for translating features into control signals [11]. - The use of Low-Rank Adaptation (LoRA) technology allows for efficient training without the need for full model fine-tuning [12]. Group 5: Experimental Results and Analysis - Extensive experiments in both simulated environments and real-world scenarios demonstrate the effectiveness of the iRe-VLA method, showing significant improvements in task success rates [26][30]. - The iRe-VLA model outperformed traditional methods, achieving a success rate increase from 43% to 83% in benchmark tasks [30]. Group 6: Conclusion and Future Implications - The article concludes that the iRe-VLA approach provides a viable solution to the challenges of deploying large models in robotic control, ensuring stability and continuous learning [37][42]. - Future research directions include efficient exploration and learning of new skills under sparse rewards, as well as developing scalable RL algorithms for large VLA models [40].
效率提升25%,灵巧操作数采困境被「臂-手共享自主框架」解决
机器之心· 2025-12-11 10:00
Core Insights - The article discusses the significant advancements in achieving dexterous manipulation capabilities in robotics through the Vision-Language-Action (VLA) model, addressing the critical challenge of high-quality data acquisition for training these models [2][6]. Group 1: Key Contributions - The research introduces a Shared Autonomy framework that effectively divides control responsibilities between human operators and autonomous AI systems, significantly reducing cognitive load and data collection costs [2][12][15]. - The DexGrasp-VLA strategy is highlighted as a foundational element of the Shared Autonomy framework, integrating multimodal inputs including tactile feedback, which enhances the robot's ability to adaptively grasp objects [9][20]. - The study establishes a complete technical system composed of four core modules, achieving a closed-loop from data collection to policy optimization [5][8]. Group 2: Data Collection and Efficiency - The Shared Autonomy framework has improved the efficiency of high-quality data collection by 25%, allowing for more data to be collected per hour and compressing the development-deployment cycle to under one day [33]. - The framework has demonstrated a near-industrial standard performance with approximately 90% success rate in grasping over 50 different objects, facilitating the transition of dexterous manipulation technology from concept validation to practical deployment [33]. Group 3: Mechanisms and Enhancements - The Arm-Hand Feature Enhancement module is designed to model and integrate the kinematic differences between the arm and hand, resulting in more natural and robust coordination of macro and micro actions [16][19]. - The Corrective Human-in-the-Loop mechanism allows the robot to learn from failures by incorporating human demonstrations of correct actions, continuously improving the strategy and generalizing to edge cases [20][34]. Group 4: Future Directions - Future research directions include expanding the framework to more complex tasks such as object reorientation and precise placement, as well as exploring intelligent fusion mechanisms to address challenges in tactile feedback [36]. - The potential for autonomous error recognition and recovery through reinforcement learning is also discussed, aiming for a smooth transition from human-robot collaboration to full autonomy [36].
AD智驾的2025年:监管刹车、技术狂飙,“地大华魔”四雄争霸
3 6 Ke· 2025-12-11 09:55
Core Insights - The automotive industry in 2025 has seen a significant shift towards safety and responsibility, moving away from exaggerated claims about autonomous driving technology [1][3] - The Chinese Ministry of Industry and Information Technology has banned the term "autonomous driving," leading to a more realistic portrayal of the technology by car manufacturers [3][5] Industry Developments - The narrative around autonomous driving has changed, with companies now focusing on "assisted driving" and "intelligent driving assistance" instead of "autonomous driving" [3][5] - The industry is characterized by two main trends: advancement in technology and democratization of intelligent driving [5][11] Key Players and Innovations - Xiaopeng Motors has introduced a second-generation VLA model that eliminates the "middleman" in the translation process, allowing for direct machine understanding of physical environments [6][7] - BYD launched the "Tian Shen Zhi Yan" high-level intelligent driving system, targeting the 100,000 yuan market with various versions, including features like highway NOA and automatic parking [11][13] - Geely has also entered the market with its own intelligent driving system, offering multiple versions with varying capabilities [11][13] Competitive Landscape - Tesla's role has evolved, with Chinese companies no longer viewing it as the sole leader in intelligent driving technology [13][14] - Horizon Robotics has gained traction with its end-to-end architecture and aims to make urban NOA widely available, achieving significant market share in the autonomous driving sector [19][21] - DJI's subsidiary, Zhuoyue Technology, has focused on practical applications and has made strides in the European market, showcasing its capabilities in urban NOA [22][24] Strategic Collaborations - Huawei has formed numerous partnerships across the automotive industry, providing comprehensive intelligent driving solutions to various manufacturers [25][28] - Momenta has expanded its collaboration network significantly, working with multiple brands to implement its driving assistance solutions [29][31] Challenges and Future Outlook - Despite advancements, the industry faces challenges related to user trust and the potential for misuse of autonomous driving systems [33][34] - The ongoing evolution of intelligent driving technology is expected to continue, with a focus on making it accessible to a broader market while addressing safety and ethical concerns [35][36]
只用SO-100可以完成π0和π0.5的效果吗?
具身智能之心· 2025-12-11 09:33
Core Viewpoint - The article discusses the challenges and complexities faced by beginners in implementing VLA (Vision-Language Alignment) models, emphasizing the need for practical experience and effective training methods to achieve successful deployment in real-world applications [2][4]. Group 1: Challenges in VLA Implementation - Many students report difficulties in achieving effective results with open-source models like GR00T and PI0, despite low training loss in simulations [2][4]. - The transition from simulation to real-world application (sim2real) poses significant challenges, particularly in data collection and model training [6][7]. - Beginners often struggle with the intricacies of data collection, model training, and deployment, leading to frustration and lack of progress [4][10]. Group 2: VLA Model Components - Data collection methods for VLA primarily include imitation learning and reinforcement learning, with a focus on high-quality data acquisition [6]. - Training VLA models typically requires simulation debugging and fine-tuning, especially when real-world data is limited [7]. - Deployment of VLA models necessitates optimization techniques such as model compression to ensure efficient performance on edge devices [9]. Group 3: Educational Initiatives - The article introduces a practical course aimed at helping students effectively learn VLA, covering various aspects such as hardware, data collection, algorithms, and real-world experiments [10][12]. - The course is designed for individuals seeking to enter the field of embodied intelligence, providing hands-on experience and project support [22][25]. - The course will commence on December 30, 2025, and includes a comprehensive curriculum to enhance participants' skills in VLA [23][26].
智能体将取代APP和SaaS,张亚勤院士发布这些AI洞见
Di Yi Cai Jing· 2025-12-10 05:56
Core Insights - The future will see more robots than humans within the next decade, with a significant shift towards intelligent agents replacing traditional SaaS and applications [1][4] - The new wave of artificial intelligence is characterized by the deep integration of information, physical, and biological intelligence, leading to a digital transformation across various domains [1][3] Group 1: Trends in AI Development - Generative AI is rapidly evolving into agent AI, with task complexity doubling in the past seven months and achieving over 50% accuracy, indicating alignment with human capabilities [3] - The scaling law's effectiveness is slowing during the pre-training phase, shifting focus to reasoning and agent-level intelligence in the post-training phase, with reasoning costs decreasing to one-tenth while agent computational demands have increased tenfold [3] - AI is transitioning from the information realm to the physical and biological worlds, exemplified by the anticipated 10% of new cars featuring autonomous driving capabilities by 2030 [3] Group 2: Robotics and Intelligent Agents - Robotics is viewed as the largest future market, with predictions that the number of robots will surpass humans within ten years, despite the current immaturity of humanoid robots [4] - Intelligent agents are expected to replace traditional SaaS services and applications, with examples such as a medical intelligent agent network simulating a hospital environment, achieving high diagnostic accuracy [4] - The goal of these intelligent agents is to assist rather than replace professionals, such as doctors, who may have dedicated intelligent assistants in the future [4] Group 3: Future Industry Landscape - The foundational large models will serve as the operating systems of the AI era, reshaping industry structures similar to how Windows and Android transformed their respective eras, with an anticipated industry scale 2-3 orders of magnitude larger than previous technological shifts [5] - It is predicted that there will be no more than ten foundational large models globally, with a split between the US and China, supplemented by a few other countries, leading to a dual-track development ecosystem of open-source and closed-source models [5] Group 4: Path to AGI - Achieving Artificial General Intelligence (AGI) will require new algorithmic frameworks, memory systems, and world models, with a potential paradigm shift in the next five years [6] - The comprehensive breakthrough in information, physical, and biological intelligence is expected to take 15 to 20 years to realize [6]
VLA 模型的泛化能力超乎你的想象:换个新相机和视角推理也能轻松搞定!
具身智能之心· 2025-12-04 03:10
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Weiqi Li等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 VLA模型在分布内任务中表现优异,但在新摄像机视角和视觉扰动下性能急剧下降。研究表明,这种脆弱性主要源于 空间建模 的对齐偏差,而非物理建模问题。 为解决此问题,中山大学等机构研究人员提出了一种单次自适应框架,通过轻量级可学习的参数更新来重新校准视觉表征。首先提出的 特征token调制(FTM) 方 法,对视觉token进行全局仿射变换,仅用4K参数就将Libero数据集的视角准确率从48.5%提升至87.1%。在此基础上, 特征线性自适应(FLA) 方法进一步为ViT编 码器引入低秩更新,以470万参数实现了90.8%的成功率,在远低于LoRA规模微调成本的情况下达到同等效果。这些结果表明,预训练VLA模型中存在大量未被挖 掘的鲁棒性潜力,并且 针对性、极小化的视觉自适应足以恢复模型的视角泛化能力。 VLA模型的泛化性 ...
2025商用具身智能白皮书:智启商业未来,身赋无限可能
Ai Rui Zi Xun· 2025-12-04 02:46
Investment Rating - The report does not explicitly state an investment rating for the industry Core Insights - Embodied intelligence is recognized as a significant direction in artificial intelligence, essential for achieving artificial general intelligence, characterized by strong interaction with the environment and continuous learning [5][12] - The global market for embodied intelligence is projected to reach 19.2 billion RMB by 2025, with a compound annual growth rate (CAGR) of 73% over the next five years, indicating a potential trillion-level market in about ten years [84][86] - The development of embodied intelligence is seen as a critical battleground in the technological competition between China and the United States, with implications for economic benefits and national competitiveness [12][15] Summary by Sections 1. Definition and Strategic Significance - Embodied intelligence integrates machine learning, computer vision, and robotics, marking a significant step towards practical AI applications [5] - It is defined as an intelligent system that interacts with the environment through physical bodies, enabling perception, understanding, decision-making, and action [6] 2. Current Development Stages and Key Challenges - The evolution of embodied intelligence is categorized into three phases: conceptual emergence (1950-2000), technological accumulation (2000-2020), and application expansion driven by large models (2020-present) [24][26] - Key challenges include data collection, technology maturity, cost of core components, and societal acceptance [28][29] 3. Global Market Trends - The market for embodied intelligence is transitioning from L2 to L3 levels of autonomy, with expectations of significant advancements in the next 2-3 years [52] - The commercial breakthrough will depend on improvements in reliability, economic efficiency, accuracy, endurance, and latency [55] 4. Industry Value Chain and Market Forecast - The industry value chain is complex, involving hardware, brain, and integration components, with significant potential for Chinese companies in downstream applications [75] - The report highlights a surge in financing for embodied intelligence companies, indicating strong investor interest and market potential [79] 5. Competitive Landscape and Key Success Factors - The competition is intensifying between Chinese and American firms, with both sides leveraging unique strengths in technology and policy support [24][25] - The report emphasizes the importance of collaboration across the industry to overcome existing bottlenecks and achieve large-scale commercialization [28][29] 6. Case Studies of Leading Companies - The report does not provide specific case studies of leading companies in the industry
理想汽车自研AI推理芯片M100明年上车
Sou Hu Cai Jing· 2025-11-27 01:31
Core Insights - Li Auto reported a total revenue of 27.4 billion yuan for Q3 2025, a year-on-year decline of 36.2%, and a net loss of 624.4 million yuan compared to a net profit of 2.8 billion yuan in the same period last year [1] Financial Performance - Total revenue for Q3 2025 was 27.4 billion yuan, down 36.2% year-on-year [1] - The company incurred a net loss of 624.4 million yuan, contrasting with a net profit of 2.8 billion yuan in the previous year [1] Technological Developments - The self-developed AI inference chip M100 is currently in large-scale system testing, with commercialization expected to start next year [3] - The M100 chip, when integrated into the next-generation VLA autonomous driving system, is anticipated to offer a cost-performance ratio exceeding three times that of current high-end chips [3] - The company aims to transition vehicles from "passive tools" to "active service providers" by 2026 with the M100 chip [3] Product Innovations - The VLA model will continue to undergo iterations, with OTA 8.0 focusing on safety experience optimization and OTA 8.1 set to enhance perception capabilities [4] - Future innovations include the industry's first defensive automatic emergency braking (AEB) feature and a full-scene parking function [4] - The VLA model's capabilities have been validated through over 312 million kilometers of actual driving data [4] Chip Development Strategy - Li Auto is concurrently developing two types of chips: an AI inference chip for autonomous driving and a SiC power chip for motor control [4] - The AI inference chip architecture is similar to Tesla's Hardware 5.0, featuring approximately 40 billion transistors, and is expected to enter mass production in 2026 [4]