WorldVLA
Search documents
达摩院开源具身大脑基模RynnBrain,登顶16项榜单,超越Gemini
Jin Rong Jie· 2026-02-10 02:56
RynnBrain在16项具身评测上实现SOTA RynnBrain还拥有良好的可拓展性,能够快速后训练出导航、规划、动作等多种具身模型,有望成为具身行业的基础模型。以具身规划模型为例,其需要强 大预测能力和场景解析能力,但基于RynnBrain为基础,只需几百条数据微调,效果就能超越Gemini 3 Pro,轻松实现SOTA。 以开源完整的推理与训练代码的方式,达摩院此次开源了RynnBrain全系列模型,共计7个,包含全尺寸基础模型与后训练专有模型,其中有业界首个MoE架 构的30B具身模型,只需要3B的推理激活参数就能超越业界的72B模型效果,因此能让机器人动作更快、更丝滑。同时,达摩院还开源了全新评测基准 RynnBrain-Bench,用于评测时空细粒度具身任务,填补了行业空白。 2月10日,阿里巴巴达摩院发布具身智能大脑基础模型RynnBrain,并一次性开源了包括30B MoE在内的7个全系列模型。RynnBrain首次让机器人拥有了时空 记忆和空间推理能力,智能水平实现大幅跃升,在16项具身开源评测榜单上刷新纪录(SOTA),超越谷歌Gemini Robotics ER 1.5等行业顶尖模型。 ...
走向融合统一的VLA和世界模型......
自动驾驶之心· 2025-12-23 09:29
Core Viewpoint - The article discusses the integration of two advanced directions in autonomous driving: Vision-Language-Action (VLA) and World Model, highlighting their complementary nature and the trend towards their fusion for enhanced decision-making capabilities in autonomous systems [2][51]. Summary by Sections Introduction to VLA and World Model - VLA, or Vision-Language-Action, is a multimodal model that interprets visual inputs and human language to make driving decisions, aiming for natural human-vehicle interaction [8][10]. - World Model is a generative spatiotemporal neural network that simulates future scenarios based on high-dimensional sensor data, enabling vehicles to predict outcomes and make safer decisions [12][14]. Comparison of VLA and World Model - VLA focuses on human interaction and interpretable end-to-end autonomous driving, while World Model emphasizes future state prediction and simulation for planning [15]. - The input for VLA includes sensor data and explicit language commands, whereas World Model relies on sequential sensor data and vehicle state [13][15]. - VLA outputs direct action control signals, while World Model provides future scene states without direct driving actions [15]. Integration and Future Directions - Both technologies share a common background in addressing the limitations of traditional modular systems and aim to enhance autonomous systems' cognitive and decision-making abilities [16][17]. - The ultimate goal for both is to enable machines to understand environments and make robust plans, with a focus on addressing corner cases in driving scenarios [18][19]. - The article suggests that the future of autonomous driving may lie in the deep integration of VLA and World Model, creating a comprehensive system that combines perception, reasoning, simulation, decision-making, and explanation [51]. Examples of Integration - The article mentions several research papers that explore the fusion of VLA and World Model, such as 3D-VLA, which aims to enhance 3D perception and planning capabilities [24][26]. - Another example is WorldVLA, which combines action generation with environmental understanding, addressing the semantic and functional gaps between the two models [28][31]. - The IRL-VLA framework proposes a closed-loop reinforcement learning approach for training VLA models without heavy reliance on simulation, enhancing their practical application [34][35]. Conclusion - The article concludes that the integration of VLA and World Model is a promising direction for the next generation of autonomous driving technologies, with ongoing developments from various industry players [51].
世界模型和VLA正在逐渐走向融合统一
自动驾驶之心· 2025-11-10 03:36
Core Viewpoint - The integration of Vision-Language Action (VLA) and World Model (WM) technologies is becoming increasingly evident, suggesting a trend towards their unification in the development of autonomous driving systems [2][4][6]. Summary by Sections VLA and WM Integration - Recent discussions highlight that VLA and WM should not be seen as opposing technologies but rather as complementary, with evidence from recent academic work supporting their combined application [2][3]. - The DriveVLA-W0 project demonstrates the feasibility of integrating VLA with WM, indicating a path towards more advanced general artificial intelligence (AGI) [3]. Language and World Models - Language models focus on abstract reasoning and high-level logic, while world models emphasize physical laws and low-level capabilities such as speed perception [3]. - The combination of these models is essential for achieving stronger embodied intelligence, with various academic explorations already underway in this area [3]. Industry Trends and Future Directions - The ongoing debate within the industry regarding VLA and WA is largely a matter of promotional terminology, with both approaches referencing similar technological foundations [6]. - The future of autonomous driving training chains is expected to incorporate VLA, reinforcement learning (RL), and WM, all of which are crucial components [4][6]. Community and Knowledge Sharing - The "Autonomous Driving Heart Knowledge Planet" community aims to provide a comprehensive platform for knowledge sharing among industry professionals and academics, facilitating discussions on technological advancements and career opportunities [9][22]. - The community has gathered over 4000 members and aims to expand to nearly 10,000, offering resources such as learning routes, Q&A sessions, and job referrals [9][22]. Educational Resources - The community offers a variety of educational materials, including video tutorials and detailed learning paths for newcomers and experienced professionals alike, covering topics from end-to-end autonomous driving to multi-sensor fusion [17][23]. - Members can access a wealth of resources, including open-source projects, datasets, and industry insights, to enhance their understanding and skills in the autonomous driving field [23][41].
阿里新研究:统一了VLA和世界模型
自动驾驶之心· 2025-11-06 08:43
Core Insights - The article discusses the WorldVLA framework, which integrates Visual Language Action models (VLA) with world models to enhance AI's understanding of the environment [1][4][36] - WorldVLA demonstrates superior performance compared to independent action and world models, showcasing a synergistic effect between the two [2][18] Group 1: Framework Overview - WorldVLA is designed as a unified autoregressive action world model that combines action and image understanding for improved predictive capabilities [4] - The framework utilizes three independent tokenizers for encoding images, text, and actions, optimizing the representation of visual and action data [8] Group 2: Model Performance - Benchmark results indicate that WorldVLA outperforms discrete action models like OpenVLA, even without pre-training, validating its architectural design [19][21] - The model's performance improves with higher image resolutions, with 512x512 pixels showing significant enhancements over 256x256 pixels [22][23] Group 3: Mutual Enhancement - The world model enhances action generation by understanding physical laws and predicting future states based on current actions [14][25] - Conversely, the action model improves the visual understanding of the world model, leading to more contextually relevant actions [17][30] Group 4: Practical Applications - WorldVLA's ability to predict the outcomes of candidate actions aids in optimizing decision-making processes, thereby increasing task success rates [26] - The framework demonstrates practical advantages in complex scenarios, such as successfully executing tasks that pure world models struggle with [32]
阿里新研究:一统VLA和世界模型
具身智能之心· 2025-10-31 00:04
Core Insights - The article discusses the development of WorldVLA, a unified framework that integrates Visual Language Action models (VLA) with world models, aimed at enhancing AI's understanding of the world [2][5]. Group 1: Framework and Model Integration - WorldVLA demonstrates significant performance improvements over independent action and world models, showcasing a mutual enhancement effect [3][20]. - The framework combines the capabilities of action models and world models to predict future images and generate actions, addressing the limitations of each model when used separately [5][6]. Group 2: Model Architecture and Training - WorldVLA utilizes three independent tokenizers for encoding images, text, and actions, with a compression ratio of 16 and a codebook size of 8192 [9]. - The model employs a novel attention mask for action generation, allowing for parallel generation of multiple actions while maintaining the integrity of the generated sequence [12][13]. Group 3: Performance Metrics and Results - Benchmark tests indicate that WorldVLA outperforms discrete action models, even without pre-training, with notable improvements in various performance metrics [20][22]. - The model's performance is positively correlated with image resolution, with 512×512 pixel resolution yielding significant enhancements over 256×256 resolution [22][24]. Group 4: Mutual Benefits of Model Types - The integration of world models enhances action models by providing a deeper understanding of environmental physics, which is crucial for tasks requiring precision [26][27]. - Conversely, action models improve the visual understanding capabilities of world models, leading to more effective action generation [18][31].
阿里新研究:统一了VLA和世界模型
3 6 Ke· 2025-10-29 10:32
Core Insights - WorldVLA is a unified framework that integrates Visual Language Action models (VLA) with world models, developed collaboratively by Alibaba DAMO Academy, Lakehead Laboratory, and Zhejiang University [1][4]. Group 1: Framework Overview - The world model predicts future images by understanding actions and images, aiming to learn the underlying physical laws of the environment to enhance action generation accuracy [2]. - The action model generates subsequent actions based on image observations, which not only aids visual understanding but also enhances the visual generation capability of the world model [2]. - Experimental results indicate that WorldVLA significantly outperforms independent action and world models, showcasing a mutual enhancement effect between the two [2][12]. Group 2: Model Architecture - WorldVLA utilizes three independent tokenizers for encoding images, text, and actions, initialized based on the Chameleon model [6]. - The image tokenizer employs a VQ-GAN model with a compression ratio of 16 and a codebook size of 8192, generating 256 tokens for 256×256 images and 1024 tokens for 512×512 images [6]. - The action tokenizer discretizes continuous robot actions into 256 intervals, represented by 7 tokens, including relative positions and angles [6]. Group 3: Training and Performance - WorldVLA employs a self-regressive training approach, where all text, actions, and images are tokenized and trained in a causal manner [8]. - A novel attention mask for action generation ensures that the current action generation relies solely on text and visual inputs, preventing errors from previous actions from affecting subsequent ones [10]. - Benchmark results show that even without pre-training, WorldVLA outperforms the discrete OpenVLA model, validating its architectural design [12]. Group 4: Mutual Benefits of Models - The introduction of the world model significantly enhances the performance of the action model by enabling it to learn the underlying physical laws of the system, which is crucial for tasks requiring precision [15]. - The world model provides predictive capabilities that inform decision-making processes, optimizing action selection strategies and improving task success rates [18]. - Conversely, the action model improves the quality of the world model's output, particularly in generating longer video sequences [21]. Group 5: Expert Opinions - Chen Long, Senior Research Director at Xiaomi Auto, emphasizes that VLA and world models do not need to be mutually exclusive; their combination can promote each other, leading to advancements in embodied intelligence (AGI) [24].
阿里新研究:统一了VLA和世界模型
量子位· 2025-10-29 09:30
Core Insights - WorldVLA is a unified framework that integrates Visual Language Action Models (VLA) with World Models, proposed by Alibaba DAMO Academy, Lake Lab, and Zhejiang University [1][4] - Experimental results indicate that WorldVLA significantly outperforms independent action models and world models, showcasing a mutual enhancement effect [2] Model Overview - The framework combines three independent tokenizers for encoding images, text, and actions, utilizing a VQ-GAN model for image tokenization with a compression ratio of 16 and a codebook size of 8192 [8] - The action tokenizer discretizes continuous robot actions into 256 intervals, representing actions with 7 tokens [8] Model Design - WorldVLA employs a self-regressive action world model to unify action and image understanding and generation [4] - The model addresses limitations of existing VLA and world models by enhancing action generation accuracy through environmental physical understanding [5][14] Training and Performance - WorldVLA is jointly trained by integrating data from both action models and world models, enhancing action generation capabilities [13] - The model's performance is positively correlated with image resolution, with 512x512 pixel resolution showing significant improvements over 256x256 [21][23] Benchmark Results - WorldVLA demonstrates superior performance compared to discrete OpenVLA models, even without pre-training, validating its architectural design [19] - The model's ability to generate coherent and physically plausible states in various scenarios is highlighted, outperforming pure world models [31][32] Mutual Enhancement - The world model enhances the action model's performance by predicting environmental state changes based on current actions, crucial for tasks requiring precision [25] - Conversely, the action model improves the visual understanding of the world model, supporting better visual generation [17][30]
VLA和World Model世界模型,哪种自动驾驶路线会胜出?
自动驾驶之心· 2025-09-04 23:33
Core Viewpoint - The article discusses the advancements and differences between Vision-Language-Action (VLA) models and World Models in the context of autonomous driving, emphasizing that while VLA is currently dominant, World Models possess inherent advantages in understanding and predicting physical realities [3][4][30]. Group 1: VLA vs. World Models - VLA currently dominates the market, with over 95% of global models generating videos for autonomous driving training rather than direct application [3]. - World Models are considered to have a significant theoretical advantage as they enable end-to-end learning without relying on language, directly linking perception to action [3][4]. - Proponents of World Models argue that they can understand the physical world and infer causal relationships, unlike VLA, which primarily mimics learned patterns [4][6]. Group 2: Development and Architecture - The World Model framework consists of three main modules: Vision Model (V), Memory RNN (M), and Controller (C), which work together to learn visual representations and predict future states [11]. - The architecture of World Models has evolved, with notable developments like RSSM and JEPA, which focus on combining deterministic and stochastic elements to enhance performance [15][17]. - JEPA, introduced in 2023, emphasizes predicting abstract representations rather than pixel-level details, significantly reducing computational requirements [17][19]. Group 3: Advantages and Challenges - World Models have two main advantages: they require less computational power than VLA and can utilize unlabelled data from the internet for training [19]. - However, challenges remain, such as the need for diverse and high-quality data to accurately understand physical environments, and the limitations of current sensors in capturing all necessary information [19][20]. - Issues like representation collapse and error accumulation in long-term predictions pose significant hurdles for the effective deployment of World Models [21][22]. Group 4: Future Directions - The integration of VLA and World Models is seen as a promising direction, with frameworks like IRL-VLA combining the strengths of both approaches for enhanced performance in autonomous driving [22][28]. - The article suggests that while VLA is likely to prevail in the near term, the combination of VLA with World Model enhancements could lead to superior outcomes in the long run [30].
FlowVLA:破解 VLA 模型 “物理失真” 难题,机器人世界建模再升级
具身智能之心· 2025-08-29 00:03
Core Viewpoint - The article discusses the limitations of traditional Vision-Language-Action (VLA) models and introduces FlowVLA, a new framework that addresses these issues by implementing a Visual Chain of Thought (Visual CoT) principle, enhancing the model's ability to predict future frames through structured physical reasoning rather than mere pixel replication [5][8][36]. Group 1: Background and Current State - VLA models, particularly those pre-trained as world models, show significant potential in the field of general robotics, primarily through large self-regressive Transformers that learn environmental dynamics from vast video data [6][7]. - Existing models face critical flaws, including task confusion leading to prediction failures, knowledge transfer inefficiencies between passive observation and active control, and entangled learning of dynamics and appearance [7]. Group 2: Contributions of FlowVLA - FlowVLA introduces a new learning framework that emphasizes structured physical reasoning by requiring the model to infer motion dynamics before predicting future frames [8][10]. - The model is designed to unify appearance and motion reasoning within a single self-regressive Transformer, maintaining parameter efficiency and architectural simplicity [9][10]. - Experimental results validate FlowVLA's superior performance across various robotic operation benchmarks, demonstrating enhanced sample efficiency and bridging the gap between pre-training and policy fine-tuning [10][20]. Group 3: Research Content - The Visual CoT reasoning process decomposes the frame prediction into a causal chain of "current frame → optical flow → future frame," allowing the model to separate dynamic and appearance learning [12][14]. - The two-phase training paradigm consists of a pre-training phase focused on world model learning and a fine-tuning phase for adapting to control tasks [15][16]. Group 4: Experimental Analysis - FlowVLA outperforms existing methods in the LIBERO dataset across all task sets, particularly excelling in long-term tasks, showcasing its robust understanding of physical dynamics [20][21]. - In the SimplerEnv dataset, FlowVLA demonstrates strong adaptability to visual domain shifts, achieving significant performance improvements in tasks where other models struggle [22][23]. - The model's sample efficiency is validated, requiring only one-third of the training steps to reach peak performance compared to baseline models, with a 55% higher peak success rate in low-data scenarios [30][32]. Group 5: Key Component Validation - Ablation studies on the LIBERO-10 benchmark highlight the importance of the Visual CoT structure, flow loss, and interleaved sequence format, confirming their critical roles in the model's performance [33][34]. Group 6: Comparison with Related Work - FlowVLA distinguishes itself from traditional VLA models by prioritizing dynamic understanding and establishing a robust world model before adapting to control tasks, thus laying a solid foundation for physical knowledge [35].
首次!世界模型、动作模型融合,全自回归模型WorldVLA来了
机器之心· 2025-07-03 08:01
Core Viewpoint - Alibaba's Damo Academy has introduced WorldVLA, a model that integrates World Model and Action Model into a unified autoregressive framework, enhancing understanding and generation across text, images, and actions [1][4]. Summary by Sections Research Overview - The development of Vision-Language-Action (VLA) models has become a significant focus in robotic action modeling, typically built on large-scale pretrained multimodal language models (MLLMs) with added action output capabilities [4]. - Existing VLA models often lack a deep understanding of actions, treating them merely as output rather than analyzing them as input [5]. Model Description - WorldVLA addresses the limitations of both VLA and World Models by using a unified autoregressive mechanism for action and image understanding and generation [5][10]. - It employs three independent encoders for processing images, text, and action data, sharing the same vocabulary to facilitate cross-modal tasks [12]. Mechanism and Strategy - The World Model component generates visual representations based on input actions, learning the physical dynamics of the environment, while the Action Model enhances visual understanding [7]. - An action attention masking strategy is introduced to mitigate error accumulation during the generation of multiple actions, significantly improving performance in action chunking tasks [8][14]. Experimental Results - In the LIBERO benchmark, WorldVLA achieved a 4% improvement in grasp success rate compared to traditional action models and a 10% reduction in Fréchet Video Distance (FVD) compared to traditional world models [8]. - The introduction of the attention mask strategy led to a performance improvement in grasp success rates ranging from 4% to 23% in action chunking tasks [8]. Comparative Analysis - WorldVLA outperformed other models in various metrics, demonstrating its effectiveness in integrating action and world modeling [18]. - The model's ability to generate the next frame based on actions and images showcases its advanced capabilities in visual prediction [24].