世界模型
Search documents
具身智能领域最新世界模型综述:250篇paper带大家梳理主流框架与任务
具身智能之心· 2025-10-30 00:03
Core Insights - The article discusses the concept of world models in embodied AI, emphasizing their role as internal simulators that help agents perceive environments, take actions, and predict future states [1][2]. Group 1: World Models Overview - The research on world models has seen unprecedented growth due to the explosion of generative models, leading to a complex array of architectures and techniques lacking a unified framework [2]. - A novel three-axis classification method is proposed to categorize existing world models based on their functionality, temporal modeling, and spatial representation [6]. Group 2: Mathematical Principles - World models are typically modeled as partially observable Markov decision processes (POMDPs), focusing on learning compact latent states from partial observations and the transition dynamics between states [4]. - The training paradigm for world models often employs a "reconstruction-regularization" approach, which encourages the model to reconstruct observations from latent states while aligning posterior inference with prior predictions [9]. Group 3: Functional Positioning - World models can be categorized into decision-coupled and general-purpose types, with the former optimized for specific decision tasks and the latter serving as task-agnostic simulators [6][15][16]. - Decision-coupled models, like the Dreamer series, excel in task performance but may struggle with generalization due to their task-specific representations [15]. - General-purpose models aim for broader predictive capabilities and transferability across tasks, though they face challenges in computational complexity and real-time inference [16]. Group 4: Temporal Modeling - Temporal modeling can be divided into sequential reasoning and global prediction, with the former focusing on step-by-step simulation and the latter predicting entire future sequences in parallel [20][23]. - Sequential reasoning is beneficial for closed-loop control but may suffer from error accumulation over long predictions [20]. - Global prediction enhances computational efficiency and reduces error accumulation but may lack detailed local dynamics [23]. Group 5: Spatial Representation - Various strategies for spatial representation include global latent vectors, token feature sequences, spatial latent grids, and decomposed rendering representations [25][28][34][35]. - Global latent vectors compress scene states into low-dimensional variables, facilitating real-time control but potentially losing fine-grained spatial information [28]. - Token feature sequences allow for detailed representation of complex scenes but require extensive data and computational resources [29]. - Spatial latent grids maintain local topology and are prevalent in autonomous driving, while decomposed rendering supports high-fidelity image generation but struggles with dynamic scenes [34][35]. Group 6: Data Resources and Evaluation Metrics - Data resources for embodied AI can be categorized into simulation platforms, interactive benchmarks, offline datasets, and real robot platforms, each serving distinct purposes in training and evaluating world models [37]. - Evaluation metrics focus on pixel-level generation quality, state/semantic consistency, and task performance, with recent trends emphasizing physical compliance and causal consistency [40].
阿里新研究:统一了VLA和世界模型
3 6 Ke· 2025-10-29 10:32
Core Insights - WorldVLA is a unified framework that integrates Visual Language Action models (VLA) with world models, developed collaboratively by Alibaba DAMO Academy, Lakehead Laboratory, and Zhejiang University [1][4]. Group 1: Framework Overview - The world model predicts future images by understanding actions and images, aiming to learn the underlying physical laws of the environment to enhance action generation accuracy [2]. - The action model generates subsequent actions based on image observations, which not only aids visual understanding but also enhances the visual generation capability of the world model [2]. - Experimental results indicate that WorldVLA significantly outperforms independent action and world models, showcasing a mutual enhancement effect between the two [2][12]. Group 2: Model Architecture - WorldVLA utilizes three independent tokenizers for encoding images, text, and actions, initialized based on the Chameleon model [6]. - The image tokenizer employs a VQ-GAN model with a compression ratio of 16 and a codebook size of 8192, generating 256 tokens for 256×256 images and 1024 tokens for 512×512 images [6]. - The action tokenizer discretizes continuous robot actions into 256 intervals, represented by 7 tokens, including relative positions and angles [6]. Group 3: Training and Performance - WorldVLA employs a self-regressive training approach, where all text, actions, and images are tokenized and trained in a causal manner [8]. - A novel attention mask for action generation ensures that the current action generation relies solely on text and visual inputs, preventing errors from previous actions from affecting subsequent ones [10]. - Benchmark results show that even without pre-training, WorldVLA outperforms the discrete OpenVLA model, validating its architectural design [12]. Group 4: Mutual Benefits of Models - The introduction of the world model significantly enhances the performance of the action model by enabling it to learn the underlying physical laws of the system, which is crucial for tasks requiring precision [15]. - The world model provides predictive capabilities that inform decision-making processes, optimizing action selection strategies and improving task success rates [18]. - Conversely, the action model improves the quality of the world model's output, particularly in generating longer video sequences [21]. Group 5: Expert Opinions - Chen Long, Senior Research Director at Xiaomi Auto, emphasizes that VLA and world models do not need to be mutually exclusive; their combination can promote each other, leading to advancements in embodied intelligence (AGI) [24].
阿里新研究:统一了VLA和世界模型
量子位· 2025-10-29 09:30
Core Insights - WorldVLA is a unified framework that integrates Visual Language Action Models (VLA) with World Models, proposed by Alibaba DAMO Academy, Lake Lab, and Zhejiang University [1][4] - Experimental results indicate that WorldVLA significantly outperforms independent action models and world models, showcasing a mutual enhancement effect [2] Model Overview - The framework combines three independent tokenizers for encoding images, text, and actions, utilizing a VQ-GAN model for image tokenization with a compression ratio of 16 and a codebook size of 8192 [8] - The action tokenizer discretizes continuous robot actions into 256 intervals, representing actions with 7 tokens [8] Model Design - WorldVLA employs a self-regressive action world model to unify action and image understanding and generation [4] - The model addresses limitations of existing VLA and world models by enhancing action generation accuracy through environmental physical understanding [5][14] Training and Performance - WorldVLA is jointly trained by integrating data from both action models and world models, enhancing action generation capabilities [13] - The model's performance is positively correlated with image resolution, with 512x512 pixel resolution showing significant improvements over 256x256 [21][23] Benchmark Results - WorldVLA demonstrates superior performance compared to discrete OpenVLA models, even without pre-training, validating its architectural design [19] - The model's ability to generate coherent and physically plausible states in various scenarios is highlighted, outperforming pure world models [31][32] Mutual Enhancement - The world model enhances the action model's performance by predicting environmental state changes based on current actions, crucial for tasks requiring precision [25] - Conversely, the action model improves the visual understanding of the world model, supporting better visual generation [17][30]
极佳视界与湖北人形机器人创新中心将共建具身智能数据工厂
Xin Lang Cai Jing· 2025-10-28 15:33
Core Insights - A strategic partnership has been established between GigaVision and the Hubei Humanoid Robot Innovation Center to create a "world model-driven virtual-physical integrated embodied intelligent data factory" [1] - The collaboration includes the launch of a foundational model called GigaBrain-0, which utilizes world model-generated data for real machine generalization in visual-language-action (VLA) [1] Group 1 - The strategic cooperation aims to enhance the development of embodied intelligence technologies [1] - The GigaBrain-0 model represents a significant advancement in integrating visual, language, and action capabilities [1] - This partnership highlights the growing trend of combining AI and robotics in industrial applications [1]
全球首个世界模型具身智能数据工厂落址武汉
Zhong Guo Xin Wen Wang· 2025-10-28 09:10
Group 1 - The world's first "world model-driven virtual-physical integrated embodied intelligent data factory" project was signed in Wuhan, marking a significant advancement in the humanoid robotics industry [1][3] - The Hubei humanoid robotics innovation center has become the largest and most diverse professional training platform for humanoid robots in China, with over 80 key enterprises in the humanoid robotics sector [1][3] - The factory will utilize world model technology to generate large-scale, diverse synthetic data, supporting the development of a comprehensive data system for embodied intelligence [3] Group 2 - The factory aims to provide ample "learning materials" for robots, enabling them to learn and grow beyond standard programming systems [3] - The advanced world model technology acts as an "imagination engine" for robots, allowing them to perceive environmental changes and predict future actions without pre-programming [3] - The establishment of this factory is expected to help Hubei become a globally recognized hub for the humanoid robotics industry [3]
高通推AI芯片与英伟达竞争;美团骑手社保补贴上线丨科技风向标
2 1 Shi Ji Jing Ji Bao Dao· 2025-10-28 03:49
Group 1: Technology Developments - Meituan launched the LongCat-Video model, capable of generating 5-minute videos with 13.6 billion parameters, aiming to enhance AI's understanding of the world through video generation tasks [2] - Qualcomm introduced new AI chips, AI200 and AI250, designed for data center AI inference, offering optimized performance and lower total cost of ownership [11] - Changjiang Storage announced the mass production of DDR5 fourth-generation RCD chips, achieving data transfer rates of up to 7200MT/s, a 12.5% improvement over the previous generation [12] Group 2: Business Initiatives - JD.com initiated the "National Good Car" delivery center recruitment plan, aiming to create a nationwide sales and service network by integrating various automotive service providers [3] - Meituan announced nationwide social security subsidies for delivery riders, allowing them to choose their insurance payment locations starting in November [4] - Yingyi Intelligent Manufacturing secured over 100 assembly orders from leading clients, enhancing its collaboration in hardware manufacturing and AI model development [7] Group 3: Financial Activities - Junsheng Electronics plans to issue approximately 155 million shares in Hong Kong, with a maximum price of HKD 23.60 per share, to fund R&D and global expansion [13] - Eagle Semiconductor completed a B+ round financing of over 700 million yuan, setting a record for VCSEL startups in China [15] - Guoyi Quantum raised 131 million yuan in strategic financing to enhance R&D and market expansion efforts [19] Group 4: Market Expansion - Didi launched 500 electric vehicles in Mexico, marking its first standardized ride-hailing service in Latin America [5] - Hengtong Optic-Electric won contracts for marine energy projects totaling 1.868 billion yuan, including a 1 million kW offshore wind project [8] - Zhenyu Technology plans to invest 2.11 billion yuan in precision components and humanoid robot modules, aiming to enhance its production capabilities [10]
世界模型==VQA?机器人不用想象画面,预测语义就够了
机器之心· 2025-10-28 00:41
Core Insights - The article discusses the necessity of precise future predictions in world models for AI, questioning whether detailed visual representations are essential for decision-making [1][6] - It introduces the concept of the Semantic World Model (SWM), which focuses on predicting semantic information about future outcomes rather than generating visual frames [9][18] Summary by Sections World Models and Their Limitations - World models enable AI to learn the dynamics of the world and predict future events based on current states [6] - Traditional models often generate realistic images but may miss critical semantic details necessary for decision-making [7][8] Semantic World Model (SWM) - SWM redefines world modeling as a visual question-answering (VQA) problem, focusing on task-relevant interactions rather than raw visual data [8][9] - SWM utilizes a visual language model (VLM) to answer questions about future actions and their semantic effects [9][11] Training and Data Generation - SWM can be trained on low-quality sequence data, including both expert and non-expert data, making it versatile [15] - A dataset called SAQA (State-Action-Question-Answer) is generated to train the model effectively [22] Experimental Results - SWM demonstrated high accuracy in answering future outcome questions and showed generalization capabilities in new scenarios [17] - In multi-task simulations, SWM significantly improved performance compared to baseline models, achieving success rates of 81.6% in LangTable and 76% in OGBench [30][34] Generalization and Robustness - SWM retains the generalization capabilities of the underlying VLM, showing improvements in performance even with new object combinations and background changes [39][41] - The model's attention mechanisms focus on task-relevant information, indicating its ability to generalize across different scenarios [41]
郑智化就“连滚带爬”表述致歉;春秋航空招聘已婚已育空嫂;宗馥莉心腹祝丽丹离职;安徽成汽车产量第一省;长安汽车一4S店起火丨邦早报
创业邦· 2025-10-28 00:10
Group 1 - Zhu Lidan, the legal representative of Hongsheng Group controlled by Zong Fuli, has left the company, with her office now assigned to Kou Jing [3][4] - Zhu Lidan has been a core member of Hongsheng Group and has had a long-standing working relationship with Zong Fuli [3] - There are reports that Zhu Lidan was summoned by relevant authorities twice since September, and her position was previously marked as "to be determined" [4] Group 2 - Changan Automobile confirmed a fire incident at a 4S store in Anhui, but no information on the cause of the fire has been provided [6] - Meituan announced a nationwide rollout of pension insurance subsidies for delivery riders starting in November, marking the first such scheme available to all riders [12][13] - Spring Airlines has launched a recruitment campaign for "air sisters," targeting married women with children and expanding the age limit to 40 years [13] Group 3 - JD.com has been granted an insurance brokerage license in Hong Kong, marking its entry into the financial market [13] - Tesla's board chair warned that if Elon Musk's $1 trillion compensation plan is not approved, the company may face significant value loss [13] - High-profile education company Gaotu is under investigation for allegedly organizing illegal offline subject training in Beijing [13] Group 4 - Amazon plans to invest over €1.4 billion in the Netherlands over the next three years, the largest investment commitment since entering the market [14] - Porsche responded to reports of multiple gasoline vehicle discontinuations, clarifying that the fuel version of Macan is not included in the changes [15] - AI startup Mercor raised $350 million at a valuation of $10 billion, with participation from notable investors [15][16] Group 5 - The global mobile game in-app purchase revenue is expected to increase by 6% to $85.4 billion by 2025 [20] - China is projected to generate over 400 million discarded mobile phones annually, with low recycling prices and privacy concerns hindering recovery efforts [20] - Anhui has become the top province in automobile production, with 15 provinces expected to produce over one million vehicles this year [20]
今年CVPR,自动驾驶还能冲什么方向?
自动驾驶之心· 2025-10-28 00:03
Core Viewpoint - The article emphasizes the importance of targeted guidance and mentorship for students aiming to publish high-quality papers in top conferences like CVPR and ICLR, highlighting the need for strategic efforts in the final stages of the submission process [1][2][4]. Group 1: Submission Guidance - The article mentions that the majority of accepted papers in past conferences focus on localized breakthroughs and verifiable improvements, aligning closely with the main themes of the respective years [1]. - It suggests that the main theme for CVPR 2026 is likely to be "world models," indicating a strategic direction for potential submissions [1]. - The article encourages students to leverage the experiences of predecessors to enhance their submission quality, particularly in the final stages of preparation [2]. Group 2: Mentorship and Support - The organization, "Automated Driving Heart," is described as the largest AI technology media platform in China, with extensive academic resources and a deep understanding of the challenges in interdisciplinary fields like autonomous driving and robotics [3]. - The article highlights the success rate of their mentorship program, with a 96% acceptance rate for students over the past three years, indicating the effectiveness of their guidance [5]. - It outlines the personalized support provided, including assistance with research thinking, familiarization with research processes, and practical application of theoretical models [7][13]. Group 3: Program Structure and Offerings - The article details the structured support offered, including personalized paper guidance, real-time interaction with mentors, and unlimited access to recorded sessions for review [13]. - It specifies that the program caters to various academic levels and goals, from foundational courses for beginners to advanced mentorship for experienced researchers [17][19]. - The organization also provides opportunities for outstanding students, such as recommendations to prestigious institutions and direct referrals to leading tech companies [19].
TeraSim World:用开源方式重建「特斯拉式」世界模型
自动驾驶之心· 2025-10-28 00:03
Core Viewpoint - Tesla has showcased its internal World Model, a neural network-driven virtual world generator that synthesizes high-resolution videos from eight camera perspectives based on vehicle states and control inputs, enabling real-time environmental predictions and closed-loop validation [2][6]. Group 1: Tesla's World Model - Tesla's World Model allows for the replay of historical problem scenarios and the injection of new adversarial events in a virtual environment for testing and reinforcement learning [2]. - The model learns a general mapping of "perception-action-world change," making it applicable to other platforms like robotics, thus forming a basis for general physical intelligence [2]. Group 2: TeraSim World Framework - A research team from the University of Michigan, SaferDrive AI, the University of Hong Kong, and Tsinghua University has developed TeraSim World, an open-source framework that achieves similar generation and evaluation capabilities as Tesla's World Model without requiring real maps or sensor backgrounds [5][6]. - TeraSim World is designed to automatically generate city environments and traffic behaviors using AI, creating a fully data-driven, reproducible, and scalable world model platform [5]. Group 3: System Features - TeraSim World features a modular, fully automated data synthesis pipeline for generating realistic and safety-critical data for end-to-end autonomous driving [7]. - The system retrieves real-world road maps and converts them into simulation-ready formats, allowing for the automatic generation of digital maps based on user input [10][11]. - It can simulate realistic traffic conditions by automatically obtaining real-time traffic data, thus reflecting local traffic patterns [13]. Group 4: Agent and Sensor Simulation - The agent simulation component enables virtual vehicles, pedestrians, and cyclists to behave like their real-world counterparts, incorporating human driving characteristics [16]. - TeraSim World introduces safety-critical scenarios based on real-world accident probabilities, ensuring the generated events are both risky and realistic [17]. - The sensor simulation aspect generates realistic camera inputs and can be extended to other sensor types, utilizing NVIDIA's open-source Cosmos models for high-resolution, time-synchronized multi-view video generation [19][22][25]. Group 5: Automated Stress Testing - TeraSim World supports automated full-stack stress testing, generating and validating various risk scenarios to assess the stability and safety boundaries of autonomous driving systems [30]. - The framework can inject dynamic and static risks, such as sudden stops or environmental changes, to evaluate system responses under diverse conditions [30]. Group 6: Conclusion and Future Plans - TeraSim World combines agent and sensor simulation to provide a comprehensive data generation process for training and testing autonomous driving systems without the need for real-world data collection [31]. - The system aims to create a large-scale synthetic driving dataset and expand to multi-modal sensor simulations, establishing an open virtual testing ground for researchers and developers [32].