Workflow
视觉语言动作模型(VLA)
icon
Search documents
微软&港科对比多种迁移技术!VLA 到底如何有效地继承 VLM 中丰富的视觉-语义先验?
具身智能之心· 2025-11-15 16:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Chuheng Zhang等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在具身智能领域,基于大型视觉语言模型(VLM)初始化训练视觉语言动作模型(VLA)已成为主流范式。但核心疑问始终未解: VLA 如何有效继承 VLM 中 丰富的视觉 - 语义先验? 微软研究院、香港科技大学等团队联合提出的 GrinningFace 基准 ,以表情符号桌面操作任务为切入点,通过模拟与真实机器人双环境实验,系统对比多种迁移 技术,不仅揭示了 VLM 先验对 VLA 泛化能力的关键作用,更为高效知识迁移提供了明确指导。 为什么需要专门的 VLA 知识迁移基准? 当前 VLA 训练虽普遍依托 VLM 初始化,但存在三大核心痛点,传统基准难以精准诊断: | 核心痛点 | 具体表现 | | --- | --- | | 先验迁移效果模糊 | VLM 的视觉 - 语义知识与 VLA 的机器人动作技能交织,无法 ...
阿里新研究:统一了VLA和世界模型
自动驾驶之心· 2025-11-06 08:43
Core Insights - The article discusses the WorldVLA framework, which integrates Visual Language Action models (VLA) with world models to enhance AI's understanding of the environment [1][4][36] - WorldVLA demonstrates superior performance compared to independent action and world models, showcasing a synergistic effect between the two [2][18] Group 1: Framework Overview - WorldVLA is designed as a unified autoregressive action world model that combines action and image understanding for improved predictive capabilities [4] - The framework utilizes three independent tokenizers for encoding images, text, and actions, optimizing the representation of visual and action data [8] Group 2: Model Performance - Benchmark results indicate that WorldVLA outperforms discrete action models like OpenVLA, even without pre-training, validating its architectural design [19][21] - The model's performance improves with higher image resolutions, with 512x512 pixels showing significant enhancements over 256x256 pixels [22][23] Group 3: Mutual Enhancement - The world model enhances action generation by understanding physical laws and predicting future states based on current actions [14][25] - Conversely, the action model improves the visual understanding of the world model, leading to more contextually relevant actions [17][30] Group 4: Practical Applications - WorldVLA's ability to predict the outcomes of candidate actions aids in optimizing decision-making processes, thereby increasing task success rates [26] - The framework demonstrates practical advantages in complex scenarios, such as successfully executing tasks that pure world models struggle with [32]
阿里新研究:一统VLA和世界模型
具身智能之心· 2025-10-31 00:04
Core Insights - The article discusses the development of WorldVLA, a unified framework that integrates Visual Language Action models (VLA) with world models, aimed at enhancing AI's understanding of the world [2][5]. Group 1: Framework and Model Integration - WorldVLA demonstrates significant performance improvements over independent action and world models, showcasing a mutual enhancement effect [3][20]. - The framework combines the capabilities of action models and world models to predict future images and generate actions, addressing the limitations of each model when used separately [5][6]. Group 2: Model Architecture and Training - WorldVLA utilizes three independent tokenizers for encoding images, text, and actions, with a compression ratio of 16 and a codebook size of 8192 [9]. - The model employs a novel attention mask for action generation, allowing for parallel generation of multiple actions while maintaining the integrity of the generated sequence [12][13]. Group 3: Performance Metrics and Results - Benchmark tests indicate that WorldVLA outperforms discrete action models, even without pre-training, with notable improvements in various performance metrics [20][22]. - The model's performance is positively correlated with image resolution, with 512×512 pixel resolution yielding significant enhancements over 256×256 resolution [22][24]. Group 4: Mutual Benefits of Model Types - The integration of world models enhances action models by providing a deeper understanding of environmental physics, which is crucial for tasks requiring precision [26][27]. - Conversely, action models improve the visual understanding capabilities of world models, leading to more effective action generation [18][31].
阿里新研究:统一了VLA和世界模型
3 6 Ke· 2025-10-29 10:32
Core Insights - WorldVLA is a unified framework that integrates Visual Language Action models (VLA) with world models, developed collaboratively by Alibaba DAMO Academy, Lakehead Laboratory, and Zhejiang University [1][4]. Group 1: Framework Overview - The world model predicts future images by understanding actions and images, aiming to learn the underlying physical laws of the environment to enhance action generation accuracy [2]. - The action model generates subsequent actions based on image observations, which not only aids visual understanding but also enhances the visual generation capability of the world model [2]. - Experimental results indicate that WorldVLA significantly outperforms independent action and world models, showcasing a mutual enhancement effect between the two [2][12]. Group 2: Model Architecture - WorldVLA utilizes three independent tokenizers for encoding images, text, and actions, initialized based on the Chameleon model [6]. - The image tokenizer employs a VQ-GAN model with a compression ratio of 16 and a codebook size of 8192, generating 256 tokens for 256×256 images and 1024 tokens for 512×512 images [6]. - The action tokenizer discretizes continuous robot actions into 256 intervals, represented by 7 tokens, including relative positions and angles [6]. Group 3: Training and Performance - WorldVLA employs a self-regressive training approach, where all text, actions, and images are tokenized and trained in a causal manner [8]. - A novel attention mask for action generation ensures that the current action generation relies solely on text and visual inputs, preventing errors from previous actions from affecting subsequent ones [10]. - Benchmark results show that even without pre-training, WorldVLA outperforms the discrete OpenVLA model, validating its architectural design [12]. Group 4: Mutual Benefits of Models - The introduction of the world model significantly enhances the performance of the action model by enabling it to learn the underlying physical laws of the system, which is crucial for tasks requiring precision [15]. - The world model provides predictive capabilities that inform decision-making processes, optimizing action selection strategies and improving task success rates [18]. - Conversely, the action model improves the quality of the world model's output, particularly in generating longer video sequences [21]. Group 5: Expert Opinions - Chen Long, Senior Research Director at Xiaomi Auto, emphasizes that VLA and world models do not need to be mutually exclusive; their combination can promote each other, leading to advancements in embodied intelligence (AGI) [24].
阿里新研究:统一了VLA和世界模型
量子位· 2025-10-29 09:30
Core Insights - WorldVLA is a unified framework that integrates Visual Language Action Models (VLA) with World Models, proposed by Alibaba DAMO Academy, Lake Lab, and Zhejiang University [1][4] - Experimental results indicate that WorldVLA significantly outperforms independent action models and world models, showcasing a mutual enhancement effect [2] Model Overview - The framework combines three independent tokenizers for encoding images, text, and actions, utilizing a VQ-GAN model for image tokenization with a compression ratio of 16 and a codebook size of 8192 [8] - The action tokenizer discretizes continuous robot actions into 256 intervals, representing actions with 7 tokens [8] Model Design - WorldVLA employs a self-regressive action world model to unify action and image understanding and generation [4] - The model addresses limitations of existing VLA and world models by enhancing action generation accuracy through environmental physical understanding [5][14] Training and Performance - WorldVLA is jointly trained by integrating data from both action models and world models, enhancing action generation capabilities [13] - The model's performance is positively correlated with image resolution, with 512x512 pixel resolution showing significant improvements over 256x256 [21][23] Benchmark Results - WorldVLA demonstrates superior performance compared to discrete OpenVLA models, even without pre-training, validating its architectural design [19] - The model's ability to generate coherent and physically plausible states in various scenarios is highlighted, outperforming pure world models [31][32] Mutual Enhancement - The world model enhances the action model's performance by predicting environmental state changes based on current actions, crucial for tasks requiring precision [25] - Conversely, the action model improves the visual understanding of the world model, supporting better visual generation [17][30]
超万平方米的人形机器人训练场在京启用
Huan Qiu Wang Zi Xun· 2025-09-25 10:04
Core Insights - The humanoid robot training facility in Beijing Shijingshan has officially commenced operations, marking a significant development in China's humanoid robot industry and providing a model for training facilities nationwide [1][7] - The facility aims to accelerate the evolution of humanoid robots' "embodied intelligence" and promote their large-scale application in sectors such as automotive manufacturing and logistics, laying a solid foundation for a trillion-dollar industry [1][7] Group 1: Training Facility Overview - The training center spans over 10,000 square meters and replicates 16 detailed scenarios across four categories: industrial manufacturing, smart home, elderly care services, and 5G integration [3] - The humanoid robot "Kuavo," standing at 1.66 meters, is actively training in various scenarios, achieving a success rate of over 95% in tasks such as empty box retrieval, material sorting, weighing, packaging, and product boxing [3][4] - The training facility's data is sourced entirely from real machine operations, addressing industry challenges related to poor data quality, high acquisition costs, and migration difficulties [3][4] Group 2: Data Quality and Standardization - The facility aims to overcome the bottlenecks in data quality and accessibility that have historically plagued the humanoid robot industry, moving from a "cottage industry" model to standardized, large-scale data production [4][5] - High-quality, large-scale training data is essential for the performance of visual language action models (VLA), which enable robots to achieve cross-platform and cross-scenario capabilities [5] - Real machine data is crucial for bridging the gap between theoretical models and practical applications, as synthetic data cannot fully replicate real-world interactions and environmental dynamics [5] Group 3: Ecosystem Development - The training center has established an innovative ecosystem model that integrates training, application, incubation, and public education, aiming to create a national public data service platform for embodied intelligence [6] - Collaborations with universities and research institutions are in place to support entrepreneurship and application scenario development, while also providing high-quality data services [6] - The facility will host the "First Embodied Intelligence Operational Task Challenge & Startup Camp," fostering innovation through a "competition-incubation" mechanism [6] Group 4: Future Implications - The establishment of this training facility signifies a new phase of large-scale and standardized development in China's humanoid robot industry [7] - The training center will enhance the skill set of robots, enabling them to perform tasks more effectively across various sectors, including factories, logistics parks, and elderly care institutions [7] - As more robots "graduate" from this training facility, their presence is expected to increase in everyday settings, facilitating the integration of intelligent robots into various industries and households [7]
上海交大卢策吾:如何破解机器人泛化与鲁棒性
Core Insights - The main focus of the articles is on the advancements in robotics, particularly in embodied intelligence, and the challenges and opportunities within the industry [1][7]. Group 1: Robotics Development - The key challenges in developing robotic intelligence are not primarily related to chip computing power but rather to the iteration of embodied model architecture and data loops [1][2] - The "digital gene" framework proposed by the company aims to enhance the understanding and execution capabilities of robots, allowing them to interpret and act upon instructions more effectively [3][4] - The company has demonstrated significant advancements in robotic applications, such as a robot serving in an ice cream shop, showcasing its ability to perform complex tasks autonomously [6] Group 2: Market Dynamics - The robotics industry is experiencing a surge in investment, with various companies seeking funding to demonstrate their commercial potential [7][8] - Despite the increased interest, the financing scale for Chinese startups in embodied intelligence remains significantly lower compared to their American counterparts, with a reported difference of nearly 12 times in private AI investment [7] - The industry is characterized by a dual focus on talent and funding, which poses challenges for startups in terms of technology strategy and validation under financial constraints [8]
灵宝机器人团队在具身智能新赛道上不断突破 让机器人“心灵手巧”(科技视点·一线探创新)
Ren Min Ri Bao· 2025-07-27 22:23
Group 1 - The core message emphasizes the importance of technological innovation in advancing China's modernization and competitiveness in the global arena [1] - The article introduces a series of reports titled "Frontline Innovation," focusing on the experiences and observations of researchers in the field of scientific innovation [1] Group 2 - Lingbao Robotics, founded in 2023, specializes in developing general humanoid robots and embodied intelligence products, with a focus on practical applications [3][4] - The company utilizes a visual language action model (VLA) to enable robots to learn skills through imitation, significantly improving the efficiency of skill acquisition [4][5] - The robots developed by Lingbao can perform precise tasks, such as assembling computer components with a precision of 0.3 mm, showcasing their advanced capabilities [3][4] Group 3 - Lingbao Robotics is working on flexible automation solutions for the shoe manufacturing industry, addressing the challenges of high costs and low adaptability in traditional production lines [6][7] - The company has developed a system that allows robots to learn to perform tasks in dynamic environments, reducing the time required for training to about one hour [7] - The humanoid robot developed by Lingbao, CASBOT 01, features a bionic hand capable of executing complex tasks, highlighting the integration of embodied intelligence and precision operation [8] Group 4 - The domestic development of embodied intelligence is rapidly advancing, with a growing variety of tactile sensors and technologies being integrated into the industry [9] - Lingbao Robotics emphasizes the importance of collaboration between academia and industry, applying the latest research findings to product development while also contributing to academic research [9]
学习端到端大模型,还不太明白VLM和VLA的区别。。。
自动驾驶之心· 2025-06-19 11:54
Core Insights - The article emphasizes the growing importance of large models (VLM) in the field of intelligent driving, highlighting their potential for practical applications and production [2][4]. Group 1: VLM and VLA - VLM (Vision-Language Model) focuses on foundational capabilities such as detection, question answering, spatial understanding, and reasoning [4]. - VLA (Vision-Language Action) is more action-oriented, aimed at trajectory prediction in autonomous driving, requiring a deep understanding of human-like reasoning and perception [4]. - It is recommended to learn VLM first before expanding to VLA, as VLM can predict trajectories through diffusion models, enhancing action capabilities in uncertain environments [4]. Group 2: Community and Resources - The article invites readers to join a knowledge-sharing community that offers comprehensive resources, including video courses, hardware, and coding materials related to autonomous driving [4]. - The community aims to build a network of professionals in intelligent driving and embodied intelligence, with a target of gathering 10,000 members in three years [4]. Group 3: Technical Directions - The article outlines four cutting-edge technical directions in the industry: Visual Language Models, World Models, Diffusion Models, and End-to-End Autonomous Driving [5]. - It provides links to various resources and papers that cover advancements in these areas, indicating a robust framework for ongoing research and development [6][31]. Group 4: Datasets and Applications - A variety of datasets are mentioned that are crucial for training and evaluating models in autonomous driving, including pedestrian detection, object tracking, and scene understanding [19][20]. - The article discusses the application of language-enhanced systems in autonomous driving, showcasing how natural language processing can improve vehicle navigation and interaction [20][21]. Group 5: Future Trends - The article highlights the potential for large models to significantly impact the future of autonomous driving, particularly in enhancing decision-making and control systems [24][25]. - It suggests that the integration of language models with driving systems could lead to more intuitive and human-like vehicle behavior [24][25].
具身智能:一场需要谦逊与耐心的科学远征
Robot猎场备忘录· 2025-05-20 05:01
Core Viewpoints - Embodied intelligence is injecting new research vitality into the robotics field and has the potential to break through performance limits [1] - The development of embodied intelligence relies on breakthroughs in specific scientific problems and should not dismiss contributions from traditional robotics [2] - General intelligence cannot exist without a focus on specific tasks, as expertise in particular areas leads to advancements in broader capabilities [3] Group 1: Interdisciplinary Collaboration - Embodied intelligence is a cross-disciplinary product that requires collaboration with fields such as material science, biomechanics, and design aesthetics [2] - Breakthroughs often occur at the intersection of disciplines, highlighting the importance of diverse scientific contributions [2] Group 2: Technology Evolution - Technological evolution should not be viewed as a complete replacement of old systems; rather, it is a process of sedimentation where foundational technologies continue to support advancements [5] - The current trend in visual-language-action models may soon be replaced by more efficient alternatives, emphasizing the need for continuous innovation [5] Group 3: Realistic Expectations for AGI - Viewing embodied intelligence as the sole path to artificial general intelligence (AGI) is a dangerous oversimplification; AGI development requires a multitude of conditions and interdisciplinary knowledge [6] - The complexity of embodied systems necessitates a collaborative approach across various fields, rather than relying on a few "genius" individuals [6] Group 4: Current State of Embodied Intelligence - The field of embodied intelligence is still in its early stages, with significant challenges remaining in hardware and algorithm development [7] - Current human-like robots are not yet fully autonomous and often require human intervention, indicating that the technology is still evolving [7] Group 5: VLA Technology Pathway - The development of visual-language-action (VLA) models may not be the most efficient approach, as operational skills often precede language capabilities in learning processes [9] - Many current VLA models are resource-intensive and may be replaced by more efficient solutions in the future [9] Group 6: Balancing Short-term and Long-term Goals - A combination of learning and modeling approaches is seen as more practical in the short term, while pure learning methods may represent the long-term future of robotics [10] - Successful robotic solutions in industry often rely on model-based methods due to their stability and reliability [10] Group 7: Human-like Robots and Practicality - The design of human-like robots is driven by emotional projection and environmental adaptability, but specialized non-human forms may offer better efficiency in many applications [11] - There is a concern about over-investment in human-like robots at the expense of practical and economically viable solutions [11] Group 8: Building Technical Barriers - True competitive advantages in technology arise from extensive practical experience and meticulous attention to detail, rather than solely from innovative algorithms [12] - Long-term technical barriers are built through consistent effort and iterative improvements in engineering practices [12] Group 9: Vision and Practicality - Scientific research requires both grand visions and grounded practices, with embodied intelligence embodying both idealistic aspirations and real-world challenges [13] - The importance of foundational theories, such as control theory, remains critical in ensuring the safety and functionality of robotic systems [13]