视觉语言动作模型(VLA)
Search documents
星海图合伙人、CFO罗天奇:具身智能尚处于技术竞赛早期阶段
Mei Ri Jing Ji Xin Wen· 2026-02-12 10:47
Core Insights - The industry of embodied intelligence is at a crossroads of capital and industrial focus, with increasing financing and frequent technological demonstrations, yet facing challenges in stability, scalability, and cost control [1] Group 1: Financing and Valuation - Starry Sea has completed a Series B financing round of 1 billion yuan, bringing its total financing to nearly 3 billion yuan and achieving a valuation of 10 billion yuan, making it a unicorn in the embodied intelligence sector [1] - The CFO of Starry Sea emphasizes that the success in the AI industry is driven by Scaling Law, where the efficiency of capital utilization is more critical than the amount of financing [1][2] Group 2: Industry Dynamics - The current phase of the embodied intelligence industry is compared to the "Hundred Groups War," where companies are advised to focus on understanding the essence of business rather than just technology [2] - The industry is transitioning from early-stage technology exploration to resource-intensive competition, with a shift in capital logic from broad investment to focusing on leading companies [2] Group 3: Commercialization and Technology - The commercialization of embodied intelligence is divided into technology-driven and business-driven aspects, with specific operational boundaries that need to be met for successful deployment [4] - The CFO believes that the industry is still in the early stages of a technological race, and companies must retain sufficient funds to cope with the increasing costs of data and model training [2][4] Group 4: Financial Potential and Business Model - The ToB (business-to-business) segment of embodied intelligence has significant revenue potential, with large orders capable of generating substantial income, but the focus should be on revenue quality metrics [5] - The long-term business model in this industry is likened to selling "tokens of the physical world," with the real barriers being intelligence levels and the ability to design and manufacture hardware [5] Group 5: Competitive Advantages - China is recognized for its data supply chain advantages, which are significantly more cost-effective than those in the U.S., allowing for greater data collection at lower costs [6] - The CFO highlights that the unique aspect of embodied intelligence companies lies in developing their foundational models for physical world execution, emphasizing the need to focus resources on building these capabilities [7]
华为进军世界模型,重新定义人类与机器的交互边界
Xuan Gu Bao· 2026-01-20 14:56
Group 1 - The core viewpoint of the article highlights the recent business change of Beijing Liuxing Space Technology Co., Ltd., which has added Huawei's Shenzhen Hubble Technology Investment Partnership as a shareholder, indicating a strategic investment in the field of world models and embodied intelligence [1] - Liuxing Space focuses on developing world models, which enable AI to understand and interact with the physical world, moving towards a vision-language-action model that integrates visual and action control [1] - The emergence of world models signifies a shift in AI capabilities, potentially redefining the interaction boundaries between humans and machines by constructing AI systems that may surpass human values [1] Group 2 - The article mentions other companies in the sector, including Shen Si Electronics and Sytak, indicating a competitive landscape in the AI and technology investment space [2]
AAAI 2026最新!OC-VLA:解决感知与动作的错位问题,以观测视角为中心的VLA范式
具身智能之心· 2026-01-18 09:33
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 OC-VLA提出的背景和动机 在 VLA 模型中,一种常见的做法是将预训练的视觉-语言模型或视觉编码器应用于下游机器人任务以增强模型的泛化能力。然而,这些视觉模型主要是在相机坐标 系中进行标注、训练和监督的,因此其潜在表征是对齐到相机空间的。相比之下,大多数机器人控制信号是在机器人基坐标系中进行定义和完成采集的。这种差异 导致感知空间和动作空间之间存在错位,阻碍了机器人策略的有效学习,特别是将预训练的视觉模型迁移到机械人控制任务时。 机器人数据通常是在多样的相机视角和异构硬件配置下收集的,这种情况下,必须从不同的第三方摄像机视角预测出在机器人坐标系中执行的相同动作。这隐式地 要求模型从有限的二维观测中重建或推断出一致的三维动作。这种不一致性在大规模预训练期间尤其有害,因为训练数据中往往存在不同的摄像机视角的观测信 息:从不同角度捕捉同一机械臂动作的 ...
王鹤团队最新!解决VLA 模型缺乏精准几何信息的问题
具身智能之心· 2026-01-05 01:03
Core Insights - The article discusses the development of the StereoVLA model, which enhances Vision-Language-Action (VLA) models by integrating stereo vision to address spatial perception challenges in robotic manipulation [1][4][16]. Group 1: Challenges in Existing VLA Models - Current VLA models primarily rely on single-view RGB images, which lack precise spatial geometric information, making them inadequate for high-precision manipulation tasks [1][4]. - Three core challenges identified include limitations of single-modal vision, difficulties in integrating geometric and semantic information, and the complexity of multi-camera setups [4][6][5]. Group 2: StereoVLA Technical Architecture - StereoVLA features a three-layer technical architecture: feature extraction, auxiliary training, and data support, which collectively enhance geometric perception and semantic understanding [8][10]. - The feature extraction module efficiently integrates geometric cues from stereo vision with semantic information from single-view images, improving the model's performance [12]. Group 3: Performance Validation - StereoVLA demonstrates significant performance improvements over existing baseline models in three key tasks, including general manipulation, bar object grasping, and small object manipulation [13][14]. - In comparative tests across various camera configurations, StereoVLA exhibited superior robustness to camera pose variations, achieving success rates of 79.3%, 71.9%, and 61.3% for small, medium, and large settings, respectively [14]. Group 4: Key Findings from Ablation Studies - Ablation studies confirmed the necessity of key design features, showing that the absence of semantic features led to a significant drop in success rates, validating the importance of geometric-semantic integration [15][18]. - The model's depth estimation strategy improved success rates by 18% compared to uniform sampling across the entire image, highlighting the effectiveness of focusing on interaction areas [18]. Group 5: Limitations and Future Directions - While StereoVLA represents a significant advancement in integrating stereo vision with VLA models, there are still areas for optimization, such as addressing long-term dependencies and enhancing feature extraction quality [16][18]. - Future work may involve expanding the model's applicability to humanoid robots and exploring additional stereo vision foundational models to further improve geometric feature quality [18].
今年的VLA+RL的工作正在排队等着录用......
具身智能之心· 2025-12-24 00:25
Core Insights - The article emphasizes the importance of Reinforcement Learning (RL) in enhancing the generalization capabilities of Vision-Language-Action (VLA) models, with some experiments showing performance improvements of up to 42.6% on out-of-distribution tasks [2]. Group 1: VLA and RL Integration - VLA models are currently reliant on RL to overcome limitations in real-world out-of-distribution scenarios, where imitation learning alone proves insufficient [2]. - Recent advancements in VLA+RL frameworks have led to significant breakthroughs, with several notable papers published this year [2]. - Tools supporting VLA+RL frameworks are evolving, with recommendations for resources like Rlinf, which offers a growing number of supported methods [2]. Group 2: Notable Research Papers - A summary of representative VLA+RL research papers from the past two years is provided, highlighting their contributions to the field [5]. - Key papers include "NORA-1.5," which focuses on a VLA model trained using world model and action-based preference rewards, and "Balancing Signal and Variance," which discusses adaptive offline RL post-training for VLA flow models [5][10]. - Other significant works include "ReinboT," which enhances robot visual-language manipulation through RL, and "WMPO," which optimizes policies based on world models for VLA [8][10]. Group 3: Future Research Directions - The article suggests that future research should align with the advancements in VLA and RL, encouraging collaboration and consultation for those interested in exploring these areas [3].
今年大概率产了n篇VLA+RL工作吧?!
具身智能之心· 2025-12-22 10:23
Core Insights - The article emphasizes the integration of Reinforcement Learning (RL) with Vision-Language-Action (VLA) models to enhance their generalization capabilities, particularly in out-of-distribution (OOD) scenarios, where performance improvements can reach up to 42.6% [2]. Group 1: Research Directions - The article suggests that future research should focus on the combination of VLA and RL, encouraging collaboration with research assistants for guidance on starting projects in these areas [3]. - Several notable recent works in VLA+RL have been highlighted, showcasing significant advancements in the field [5][10]. Group 2: Notable Papers and Projects - A list of representative papers from the last two years is provided, including titles such as "NORA-1.5" and "Balancing Signal and Variance," which focus on various aspects of VLA and RL integration [5][10]. - Links to project homepages and paper PDFs are shared for further exploration of these works [6][9][12]. Group 3: Tools and Frameworks - The article mentions the development of tools like Rlinf, which supports a growing number of methods for VLA+RL frameworks, indicating a trend towards more robust and versatile research tools [2][11].
微软&港科对比多种迁移技术!VLA 到底如何有效地继承 VLM 中丰富的视觉-语义先验?
具身智能之心· 2025-11-15 16:03
Core Insights - The article discusses the introduction of the GrinningFace benchmark, which aims to address the challenges in knowledge transfer from Visual Language Models (VLM) to Visual Language Action Models (VLA) by using emoji-based tasks as a testing ground [1][2][4]. Group 1: Challenges in VLA Training - VLA training relies heavily on VLM initialization but faces three main challenges: unclear transfer effects, the risk of catastrophic forgetting, and lack of standardized comparison for different transfer techniques [2][4]. - Existing datasets have low overlap with VLM pre-training data, making it difficult to isolate contributions from "robotic action skills" and "VLM prior knowledge" [2]. Group 2: GrinningFace Benchmark Design - The GrinningFace benchmark uses emojis as a bridge to separate action execution from semantic recognition, allowing for precise measurement of knowledge transfer effects [4][5]. - The benchmark includes a standardized task where a robotic arm must place a cube on an emoji card based on language instructions [4]. Group 3: Evaluation Metrics - The evaluation framework consists of two core metrics: execution success rate (SR) and recognition SR, which quantify the robot's ability to perform actions and recognize semantic cues, respectively [5][8]. - The study found that different fine-tuning strategies have varying impacts on knowledge transfer, with a focus on retaining VLM prior knowledge while adapting to specific tasks [5][11]. Group 4: Key Findings on Transfer Techniques - The research highlights that co-training, latent action prediction, and diverse pre-training data are critical for effective knowledge transfer [7][19]. - The balance between retaining VLM prior knowledge and adapting robotic actions is identified as a core principle in VLA design [19]. Group 5: Future Directions - Future work should focus on optimizing parameter-efficient fine-tuning techniques, enhancing knowledge transfer efficiency, and designing complex tasks that reflect real-world applications [19]. - Exploring multimodal prior fusion, including tactile and auditory information, could improve VLA's adaptability to various environments [19].
阿里新研究:统一了VLA和世界模型
自动驾驶之心· 2025-11-06 08:43
Core Insights - The article discusses the WorldVLA framework, which integrates Visual Language Action models (VLA) with world models to enhance AI's understanding of the environment [1][4][36] - WorldVLA demonstrates superior performance compared to independent action and world models, showcasing a synergistic effect between the two [2][18] Group 1: Framework Overview - WorldVLA is designed as a unified autoregressive action world model that combines action and image understanding for improved predictive capabilities [4] - The framework utilizes three independent tokenizers for encoding images, text, and actions, optimizing the representation of visual and action data [8] Group 2: Model Performance - Benchmark results indicate that WorldVLA outperforms discrete action models like OpenVLA, even without pre-training, validating its architectural design [19][21] - The model's performance improves with higher image resolutions, with 512x512 pixels showing significant enhancements over 256x256 pixels [22][23] Group 3: Mutual Enhancement - The world model enhances action generation by understanding physical laws and predicting future states based on current actions [14][25] - Conversely, the action model improves the visual understanding of the world model, leading to more contextually relevant actions [17][30] Group 4: Practical Applications - WorldVLA's ability to predict the outcomes of candidate actions aids in optimizing decision-making processes, thereby increasing task success rates [26] - The framework demonstrates practical advantages in complex scenarios, such as successfully executing tasks that pure world models struggle with [32]
阿里新研究:一统VLA和世界模型
具身智能之心· 2025-10-31 00:04
Core Insights - The article discusses the development of WorldVLA, a unified framework that integrates Visual Language Action models (VLA) with world models, aimed at enhancing AI's understanding of the world [2][5]. Group 1: Framework and Model Integration - WorldVLA demonstrates significant performance improvements over independent action and world models, showcasing a mutual enhancement effect [3][20]. - The framework combines the capabilities of action models and world models to predict future images and generate actions, addressing the limitations of each model when used separately [5][6]. Group 2: Model Architecture and Training - WorldVLA utilizes three independent tokenizers for encoding images, text, and actions, with a compression ratio of 16 and a codebook size of 8192 [9]. - The model employs a novel attention mask for action generation, allowing for parallel generation of multiple actions while maintaining the integrity of the generated sequence [12][13]. Group 3: Performance Metrics and Results - Benchmark tests indicate that WorldVLA outperforms discrete action models, even without pre-training, with notable improvements in various performance metrics [20][22]. - The model's performance is positively correlated with image resolution, with 512×512 pixel resolution yielding significant enhancements over 256×256 resolution [22][24]. Group 4: Mutual Benefits of Model Types - The integration of world models enhances action models by providing a deeper understanding of environmental physics, which is crucial for tasks requiring precision [26][27]. - Conversely, action models improve the visual understanding capabilities of world models, leading to more effective action generation [18][31].
阿里新研究:统一了VLA和世界模型
3 6 Ke· 2025-10-29 10:32
Core Insights - WorldVLA is a unified framework that integrates Visual Language Action models (VLA) with world models, developed collaboratively by Alibaba DAMO Academy, Lakehead Laboratory, and Zhejiang University [1][4]. Group 1: Framework Overview - The world model predicts future images by understanding actions and images, aiming to learn the underlying physical laws of the environment to enhance action generation accuracy [2]. - The action model generates subsequent actions based on image observations, which not only aids visual understanding but also enhances the visual generation capability of the world model [2]. - Experimental results indicate that WorldVLA significantly outperforms independent action and world models, showcasing a mutual enhancement effect between the two [2][12]. Group 2: Model Architecture - WorldVLA utilizes three independent tokenizers for encoding images, text, and actions, initialized based on the Chameleon model [6]. - The image tokenizer employs a VQ-GAN model with a compression ratio of 16 and a codebook size of 8192, generating 256 tokens for 256×256 images and 1024 tokens for 512×512 images [6]. - The action tokenizer discretizes continuous robot actions into 256 intervals, represented by 7 tokens, including relative positions and angles [6]. Group 3: Training and Performance - WorldVLA employs a self-regressive training approach, where all text, actions, and images are tokenized and trained in a causal manner [8]. - A novel attention mask for action generation ensures that the current action generation relies solely on text and visual inputs, preventing errors from previous actions from affecting subsequent ones [10]. - Benchmark results show that even without pre-training, WorldVLA outperforms the discrete OpenVLA model, validating its architectural design [12]. Group 4: Mutual Benefits of Models - The introduction of the world model significantly enhances the performance of the action model by enabling it to learn the underlying physical laws of the system, which is crucial for tasks requiring precision [15]. - The world model provides predictive capabilities that inform decision-making processes, optimizing action selection strategies and improving task success rates [18]. - Conversely, the action model improves the quality of the world model's output, particularly in generating longer video sequences [21]. Group 5: Expert Opinions - Chen Long, Senior Research Director at Xiaomi Auto, emphasizes that VLA and world models do not need to be mutually exclusive; their combination can promote each other, leading to advancements in embodied intelligence (AGI) [24].