大型视觉语言模型（VLM） - filings, earnings calls, financial reports, news

大型视觉语言模型（VLM）

Search documents

微软&港科对比多种迁移技术！VLA 到底如何有效地继承 VLM 中丰富的视觉-语义先验？

具身智能之心· 2025-11-15 16:03

点击下方卡片，关注" 具身智能之心 "公众号作者丨 Chuheng Zhang等编辑丨具身智能之心本文只做学术分享，如有侵权，联系删文 >> 点击进入→ 具身智能之心技术交流群更多干货，欢迎加入国内首个具身智能全栈学习社区：具身智能之心知识星球 (戳我) ，这里包含所有你想要的。在具身智能领域，基于大型视觉语言模型（VLM）初始化训练视觉语言动作模型（VLA）已成为主流范式。但核心疑问始终未解： VLA 如何有效继承 VLM 中丰富的视觉 - 语义先验？微软研究院、香港科技大学等团队联合提出的 GrinningFace 基准，以表情符号桌面操作任务为切入点，通过模拟与真实机器人双环境实验，系统对比多种迁移技术，不仅揭示了 VLM 先验对 VLA 泛化能力的关键作用，更为高效知识迁移提供了明确指导。为什么需要专门的 VLA 知识迁移基准？当前 VLA 训练虽普遍依托 VLM 初始化，但存在三大核心痛点，传统基准难以精准诊断： | 核心痛点 | 具体表现 | | --- | --- | | 先验迁移效果模糊 | VLM 的视觉 - 语义知识与 VLA 的机器人动作技能交织，无法 ...

机器人操控新范式：一篇VLA模型系统性综述 | Jinqiu Select

锦秋集· 2025-09-02 13:41

Core Insights - The article discusses the emergence of Vision-Language-Action (VLA) models based on large Vision-Language Models (VLMs) as a transformative paradigm in robotic manipulation, addressing the limitations of traditional methods in unstructured environments [1][4][5] - It highlights the need for a structured classification framework to mitigate research fragmentation in the rapidly evolving VLA field [2] Group 1: New Paradigm in Robotic Manipulation - Robotic manipulation is a core challenge at the intersection of robotics and embodied AI, requiring deep understanding of visual and semantic cues in complex environments [4] - Traditional methods rely on predefined control strategies, which struggle in unstructured real-world scenarios, revealing limitations in scalability and generalization [4][5] - The advent of large VLMs has provided a revolutionary approach, enabling robots to interpret high-level human instructions and generalize to unseen objects and scenes [5][10] Group 2: VLA Model Definition and Classification - VLA models are defined as systems that utilize a large VLM to understand visual observations and natural language instructions, followed by a reasoning process that generates robotic actions [6][7] - VLA models are categorized into two main types: Monolithic Models and Hierarchical Models, each with distinct architectures and functionalities [7][8] Group 3: Monolithic Models - Monolithic VLA models can be implemented in single-system or dual-system architectures, integrating perception and action generation into a unified framework [14][15] - Single-system models process all modalities together, while dual-system models separate reflective reasoning from reactive behavior, enhancing efficiency [15][16] Group 4: Hierarchical Models - Hierarchical models consist of a planner and a policy, allowing for independent operation and modular design, which enhances flexibility in task execution [43] - These models can be further divided into Planner-Only and Planner+Policy categories, with the former focusing solely on planning and the latter integrating action execution [43][44] Group 5: Advancements in VLA Models - Recent advancements in VLA models include enhancements in perception modalities, such as 3D and 4D perception, as well as the integration of tactile and auditory information [22][23][24] - Efforts to improve reasoning capabilities and generalization abilities are crucial for enabling VLA models to perform complex tasks in diverse environments [25][26] Group 6: Performance Optimization - Performance optimization in VLA models focuses on enhancing inference efficiency through architectural adjustments, parameter optimization, and inference acceleration techniques [28][29][30] - Dual-system models have emerged to balance deep reasoning with real-time action generation, facilitating smoother deployment in real-world scenarios [35] Group 7: Future Directions - Future research directions include the integration of memory mechanisms, 4D perception, efficient adaptation, and multi-agent collaboration to further enhance VLA model capabilities [1][6]

基于大型VLM的VLA模型如何改一步一步推动机器人操作任务的发展？

具身智能之心· 2025-08-26 00:03

Core Viewpoint - The article discusses the transformative impact of large Vision-Language Models (VLMs) on robotic manipulation, enabling robots to understand and execute complex tasks through natural language instructions and visual cues [3][4][5]. Group 1: VLA Model Development - The emergence of Vision-Language-Action (VLA) models, driven by large VLMs, allows robots to interpret visual details and human instructions, converting this understanding into executable actions [4][5]. - The article highlights the evolution of VLA models, categorizing them into monolithic and hierarchical architectures, and identifies key challenges and future directions in the field [9][10][11]. Group 2: Research Contributions - The research from Harbin Institute of Technology (Shenzhen) provides a comprehensive survey of VLA models, detailing their definitions, core architectures, and integration with reinforcement learning and human video learning [5][9][10]. - The survey aims to unify terminology and modeling assumptions in the VLA field, addressing fragmentation across disciplines such as robotics, computer vision, and natural language processing [17][18]. Group 3: Technical Advancements - VLA models leverage the capabilities of large VLMs, including open-world generalization, hierarchical task planning, knowledge-enhanced reasoning, and rich multimodal integration [13][64]. - The article outlines the limitations of traditional robotic methods and how VLA models overcome these by enabling robots to handle unstructured environments and vague instructions effectively [16][24]. Group 4: Future Directions - The article emphasizes the need for advancements in 4D perception and memory mechanisms to enhance the capabilities of VLA models in long-term task execution [5][16]. - It also discusses the importance of developing unified frameworks for VLA models to improve their adaptability across various tasks and environments [17][66].