CogAgent

Search documents
机器人操控新范式:一篇VLA模型系统性综述 | Jinqiu Select
锦秋集· 2025-09-02 13:41
Core Insights - The article discusses the emergence of Vision-Language-Action (VLA) models based on large Vision-Language Models (VLMs) as a transformative paradigm in robotic manipulation, addressing the limitations of traditional methods in unstructured environments [1][4][5] - It highlights the need for a structured classification framework to mitigate research fragmentation in the rapidly evolving VLA field [2] Group 1: New Paradigm in Robotic Manipulation - Robotic manipulation is a core challenge at the intersection of robotics and embodied AI, requiring deep understanding of visual and semantic cues in complex environments [4] - Traditional methods rely on predefined control strategies, which struggle in unstructured real-world scenarios, revealing limitations in scalability and generalization [4][5] - The advent of large VLMs has provided a revolutionary approach, enabling robots to interpret high-level human instructions and generalize to unseen objects and scenes [5][10] Group 2: VLA Model Definition and Classification - VLA models are defined as systems that utilize a large VLM to understand visual observations and natural language instructions, followed by a reasoning process that generates robotic actions [6][7] - VLA models are categorized into two main types: Monolithic Models and Hierarchical Models, each with distinct architectures and functionalities [7][8] Group 3: Monolithic Models - Monolithic VLA models can be implemented in single-system or dual-system architectures, integrating perception and action generation into a unified framework [14][15] - Single-system models process all modalities together, while dual-system models separate reflective reasoning from reactive behavior, enhancing efficiency [15][16] Group 4: Hierarchical Models - Hierarchical models consist of a planner and a policy, allowing for independent operation and modular design, which enhances flexibility in task execution [43] - These models can be further divided into Planner-Only and Planner+Policy categories, with the former focusing solely on planning and the latter integrating action execution [43][44] Group 5: Advancements in VLA Models - Recent advancements in VLA models include enhancements in perception modalities, such as 3D and 4D perception, as well as the integration of tactile and auditory information [22][23][24] - Efforts to improve reasoning capabilities and generalization abilities are crucial for enabling VLA models to perform complex tasks in diverse environments [25][26] Group 6: Performance Optimization - Performance optimization in VLA models focuses on enhancing inference efficiency through architectural adjustments, parameter optimization, and inference acceleration techniques [28][29][30] - Dual-system models have emerged to balance deep reasoning with real-time action generation, facilitating smoother deployment in real-world scenarios [35] Group 7: Future Directions - Future research directions include the integration of memory mechanisms, 4D perception, efficient adaptation, and multi-agent collaboration to further enhance VLA model capabilities [1][6]
智谱CEO张鹏:加速Agent模型产品研发,期待尽快实现一句话操作电脑和手机
IPO早知道· 2024-11-30 02:36
本文为IPO早知道原创 作者|Stone Jin 微信公众号|ipozaozhidao 据IPO早知道消息,作为最早探索 Agent 的大模型企业之一,智谱于11月29日带来了多个新进展: AutoGLM 可以自主执行超过50步的长步骤操作,也可以跨 A pp执行任务 ; AutoGLM开启「全 自动」上网新体验,支持等数十个网站的无人驾驶 ; 像人一样操作计算机的GLM-PC 启动内测, 基于视觉多模态模型实现通用Agent的技术探索 。 具体来讲,新升级的AutoGLM可以挑战完成复杂任务:1. 超长任务:理解超长指令,执行超长任 务。2. 跨App:AutoGLM 支持跨App来执行任务。3. 短口令:AutoGLM能够支持长任务的自定义 短语。4. 随便模式:AutoGLM可以主动帮你做出决策。 同时 AutoGLM启动大规模内测,并将尽快上线成为面向C端用户的产品。AutoGLM同时宣布启动 「10个亿级App免费Auto升级」的计划,邀请 App 伙伴联合探索自己的Auto新场景。 此外,智谱还带来基于PC的自主Agent——GLM-PC是GLM团队面向「无人驾驶」PC的一次技术 探索,基于智谱的 ...