Workflow
视觉 - 语言 - 动作(VLA)大模型
icon
Search documents
具身智能年度回望:泡沫与现实的激烈碰撞
腾讯研究院· 2026-02-26 09:03
Core Viewpoint - The year 2025 is characterized as a pivotal year for embodied intelligence, marked by significant advancements in technology and increased capital investment, despite ongoing challenges in supply chain restructuring and business model validation [4][5]. Investment Landscape - In 2025, the investment events in China's embodied intelligence and robotics sector reached 325, with a total amount of 39.832 billion RMB, indicating a substantial increase compared to 2024 [7]. - The investment landscape has shifted, with corporate venture capital (CVC) gaining prominence over traditional financial venture capital (VC), as major internet companies actively invest in the sector [6][8]. - Major players like Alibaba, Meituan, and Tencent have invested in numerous companies within the embodied intelligence supply chain, focusing on strategic alignment with their business needs [8]. Market Dynamics - The top 10 companies in the sector captured nearly 41% of the total financing, highlighting a growing disparity in resource allocation [8]. - Startups lacking core technological advantages face increased difficulty in securing funding, as investor focus shifts from team backgrounds to delivery capabilities [9]. Technological Advancements - The maturity of Vision-Language-Action (VLA) models has significantly enhanced robots' ability to understand natural language commands and perform tasks, marking a major technological breakthrough [13][14]. - Despite advancements, challenges remain in execution capabilities, particularly in physical execution and adaptability in unstructured environments [16][17]. Industry Challenges - The industry faces a mismatch between supply and demand, with many orders directed towards educational projects rather than industrial applications, leading to cautious attitudes from potential industrial clients [10][11]. - The current state of embodied intelligence products reflects gaps in engineering reliability and industrial standards, necessitating further development [11]. Future Outlook - By 2026, the industry is expected to transition from a technology competition phase to a commercial realization phase, with a focus on cost-effectiveness and return on investment [20][21]. - The geographical distribution of resources is likely to concentrate in regions like the Pearl River Delta and Yangtze River Delta, which have advantages in hardware supply chains and talent density [23]. - The market is anticipated to undergo a significant reshaping, with weaker companies facing elimination as the capital market returns to rationality [24].
NeurIPS 2025|清华团队分析RL将如何提升VLA泛化性
具身智能之心· 2025-10-15 04:00
Core Insights - The article discusses the potential of Vision-Language-Action (VLA) models in embodied intelligence and highlights the limitations of current supervised fine-tuning (SFT) methods in achieving human-like generalization. It emphasizes the advantages of Reinforcement Learning (RL) in enhancing the generalization capabilities of VLA models [1][3]. Group 1: Research Findings - A new evaluation benchmark was created to address the limited generalization of VLA models, comparing the performance of RL and SFT in enhancing model robustness across various visual, semantic, and execution challenges [3][19]. - Experiments revealed that using RL algorithms like Proximal Policy Optimization (PPO) significantly improved the model's robustness in semantic understanding and task execution, while maintaining performance in visually varied scenarios comparable to SFT [3][12]. Group 2: Methodology - The research utilized the open-source OpenVLA model, which is fine-tuned from Llama2-7b, to conduct experiments involving RGB images and action tokens for robotic control [6]. - Three RL methods were tested: PPO, Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO), with PPO showing notable advantages in multi-step decision tasks [8][15]. Group 3: PPO Training Innovations - The research team proposed three key innovations for efficient PPO training: 1. A shared Actor-Critic architecture that reduced memory usage by 45% and improved training speed by 35% [12][14]. 2. A preheating strategy using 140 high-quality trajectories that enhanced convergence speed by 50% [14]. 3. Minimizing PPO training epochs to just one, which was sufficient for performance without increasing training time [14]. Group 4: Comparison of SFT and RL - The study found that while SFT performance plateaued with 16,000 demonstration trajectories, RL achieved a 42.6% performance improvement in out-of-distribution tasks, indicating superior generalization capabilities [17][18]. - A comprehensive evaluation benchmark was developed to dissect the differences in generalization capabilities between SFT and RL across visual, semantic, and execution dimensions [19][21]. Group 5: Practical Implications - The research underscores the core value of RL in building truly generalizable embodied agents, which is increasingly important as robotic applications become more complex and varied [25].