视觉 - 语言 - 动作(VLA)大模型
Search documents
具身智能年度回望:泡沫与现实的激烈碰撞
腾讯研究院· 2026-02-26 09:03
以下文章来源于数字社会发展与研究 ,作者程普 数字社会发展与研究 . Digital Society Development Research Center 数字社会发展研究中心 | 专注于数字技术发展及相应经济、 社会、人文领域的新现象,新问题的公共性观察与研究。 程普 独立科技观察者 本文转载自"数字社会发展与研究" 2025年,具身智能领域仍在努力完成从实验室向产业化落地的惊险一跃。 随着大模型在感知与决策层面的突破,以及国家战略资源的集中注入,这种跨越取得了一些实质性进展, 但供应链重构、复杂场景验证以及商业模式闭环等层面的挑战依然严峻。 这一年,从宇树机器人登上春晚舞台的惊艳亮相,到具身智能被正式写入2025年政府工作报告,并纳入国 家"十五五"规划建议,行业获得了从政策到资本的全方位加持,到工信部人形机器人与具身智能标委会成 立,再到机器人马拉松在北京亦庄拉开帷幕,钢铁躯体在赛道上的奔跑,成为行业狂奔的缩影。 2025年,是名副其实的量产元年,也是资本狂热之年,更是泡沫与现实激烈碰撞的一年。 资本与产业 资源集聚与市场分化的双重变奏 2025年具身智能一级市场,呈现出资金集聚与结构分化并存的复杂局 ...
NeurIPS 2025|清华团队分析RL将如何提升VLA泛化性
具身智能之心· 2025-10-15 04:00
Core Insights - The article discusses the potential of Vision-Language-Action (VLA) models in embodied intelligence and highlights the limitations of current supervised fine-tuning (SFT) methods in achieving human-like generalization. It emphasizes the advantages of Reinforcement Learning (RL) in enhancing the generalization capabilities of VLA models [1][3]. Group 1: Research Findings - A new evaluation benchmark was created to address the limited generalization of VLA models, comparing the performance of RL and SFT in enhancing model robustness across various visual, semantic, and execution challenges [3][19]. - Experiments revealed that using RL algorithms like Proximal Policy Optimization (PPO) significantly improved the model's robustness in semantic understanding and task execution, while maintaining performance in visually varied scenarios comparable to SFT [3][12]. Group 2: Methodology - The research utilized the open-source OpenVLA model, which is fine-tuned from Llama2-7b, to conduct experiments involving RGB images and action tokens for robotic control [6]. - Three RL methods were tested: PPO, Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO), with PPO showing notable advantages in multi-step decision tasks [8][15]. Group 3: PPO Training Innovations - The research team proposed three key innovations for efficient PPO training: 1. A shared Actor-Critic architecture that reduced memory usage by 45% and improved training speed by 35% [12][14]. 2. A preheating strategy using 140 high-quality trajectories that enhanced convergence speed by 50% [14]. 3. Minimizing PPO training epochs to just one, which was sufficient for performance without increasing training time [14]. Group 4: Comparison of SFT and RL - The study found that while SFT performance plateaued with 16,000 demonstration trajectories, RL achieved a 42.6% performance improvement in out-of-distribution tasks, indicating superior generalization capabilities [17][18]. - A comprehensive evaluation benchmark was developed to dissect the differences in generalization capabilities between SFT and RL across visual, semantic, and execution dimensions [19][21]. Group 5: Practical Implications - The research underscores the core value of RL in building truly generalizable embodied agents, which is increasingly important as robotic applications become more complex and varied [25].