Workflow
世界模型
icon
Search documents
世界模型有了开源基座Emu3.5!拿下多模态SOTA,性能超越Nano Banana
量子位· 2025-10-30 10:31
Core Insights - The article discusses the launch of the latest open-source native multimodal world model, Emu3.5, developed by the Beijing Academy of Artificial Intelligence (BAAI) [1] - Emu3.5 is designed to enhance the understanding of dynamic physical worlds, moving beyond mere visual realism to a deeper comprehension of context and interactions [8][10] Group 1: Model Capabilities - Emu3.5 can perform high-precision tasks such as erasing handwritten marks and generating dynamic 3D environments from a first-person perspective [2][3] - The model excels in generating coherent and logical outputs, simulating dynamic physical worlds, and maintaining spatial consistency during user interactions [11][20] - It can execute complex tasks like organizing a desktop by following a series of instructions, showcasing its ability to understand long-term sequences and spatial relationships [23][24][28] Group 2: Technical Innovations - Emu3.5 operates on a 34 billion parameter framework, utilizing a standard Decoder-only Transformer architecture to handle various tasks including visual storytelling and image editing [31] - The model has been pre-trained on over 10 trillion tokens of multimodal data, primarily sourced from internet videos, allowing it to learn temporal continuity and causal relationships effectively [32] - A powerful visual tokenizer with a vocabulary of 130,000 visual tokens enables high-fidelity image reconstruction at resolutions up to 2K [33] Group 3: Performance and Comparisons - Emu3.5's performance is competitive, matching or surpassing that of Gemini-2.5-Flash-Image in several authoritative benchmarks, particularly in text rendering and multimodal generation tasks [18] - The model's ability to maintain consistency and style across multiple images and instructions is noted as being at the industry's top level [29] Group 4: Future Implications - The open-source nature of Emu3.5 allows global developers and researchers to leverage its capabilities without starting from scratch, potentially transforming various industries [36] - The model's advancements in generating realistic videos and intelligent agents open up vast possibilities for practical applications across different sectors [37]
清华陈建宇团队× 斯坦福Chelsea课题组推出 Ctrl-World 可控世界模型,让机器人在想象中迭代
机器人大讲堂· 2025-10-30 10:18
Core Insights - The article discusses the breakthrough research "Ctrl-World," a controllable generative world model for robot manipulation developed by Chelsea Finn's team at Stanford University and Chen Jianyu's team at Tsinghua University, which significantly improves robot training efficiency and effectiveness [1][9][28]. Group 1: Research Background - The current challenges in robot training include high costs of strategy evaluation and insufficient data for strategy iteration, particularly in open-world scenarios [7][8]. - Traditional world models have limitations such as single-view predictions leading to hallucinations, imprecise action control, and poor long-term consistency [9][8]. Group 2: Ctrl-World Innovations - Ctrl-World introduces three key innovations: multi-view joint prediction, frame-level action control, and pose-conditioned memory retrieval, addressing the limitations of traditional models [9][11][15]. - The model uses multi-view inputs to reduce hallucination rates and improve accuracy in predicting robot interactions with objects [13][14]. - Frame-level action control ensures that visual predictions are tightly aligned with the robot's actions, allowing for centimeter-level precision [15][16]. - Pose-conditioned memory retrieval stabilizes long-term predictions, enabling coherent trajectory generation over extended periods [17][18]. Group 3: Experimental Validation - Experiments on the DROID robot platform demonstrated that Ctrl-World outperforms traditional models across multiple metrics, including PSNR, SSIM, and FVD, indicating superior visual fidelity and temporal coherence [20][21]. - The model's ability to adapt to unseen camera layouts showcases its generalization capabilities [22]. - Virtual evaluations of strategy performance closely align with real-world outcomes, significantly reducing evaluation time from weeks to hours [24][26]. Group 4: Strategy Optimization - Ctrl-World enables the generation of virtual trajectories that improve real-world strategy performance, achieving an average success rate increase from 38.7% to 83.4% without consuming physical resources [27][26]. - The optimization process involves virtual exploration, data selection, and supervised fine-tuning, leading to substantial improvements in task success rates across various scenarios [26][27]. Group 5: Future Directions - Despite its achievements, Ctrl-World has room for improvement, particularly in adapting to complex physical scenarios and reducing sensitivity to initial observations [28]. - Future plans include integrating video generation with reinforcement learning and expanding the training dataset to enhance model adaptability to extreme environments [28].
让机器人在“想象”中学习世界的模型来了!PI联创课题组&清华陈建宇团队联合出品
量子位· 2025-10-30 08:39
Core Insights - The article discusses the breakthrough of the Ctrl-World model, a controllable generative world model for robot manipulation, developed by a collaboration between Stanford University and Tsinghua University, which significantly enhances robot task performance in simulated environments [4][12]. Group 1: Model Overview - Ctrl-World allows robots to perform task simulations, strategy evaluations, and self-iterations in an "imagination space" [5]. - The model uses zero real machine data, improving instruction-following success rates from 38.7% to 83.4%, with an average improvement of 44.7% [5][49]. - The related paper titled "CTRL-WORLD: A CONTROLLABLE GENERATIVE WORLD MODEL FOR ROBOT MANIPULATION" has been published on arXiv [5]. Group 2: Challenges Addressed - The model addresses two main challenges in robot training: high costs and inefficiencies in strategy evaluation, and the inadequacy of real-world data for strategy iteration [7][9]. - Traditional methods require extensive real-world testing, which is costly and time-consuming, often leading to mechanical failures and high operational costs [8][9]. - Existing models struggle with open-world scenarios, particularly in active interaction with advanced strategies [10]. Group 3: Innovations in Ctrl-World - Ctrl-World introduces three key innovations: multi-view joint prediction, frame-level action control, and pose-conditioned memory retrieval [13][20]. - Multi-view joint prediction reduces hallucination rates by combining third-person and wrist views, enhancing the accuracy of future trajectory generation [16][23]. - Frame-level action control establishes a strong causal relationship between actions and visual outcomes, allowing for centimeter-level precision in simulations [24][29]. - Pose-conditioned memory retrieval ensures long-term consistency in simulations, maintaining coherence over extended periods [31][36]. Group 4: Experimental Validation - Experiments on the DROID robot platform demonstrated that Ctrl-World outperforms traditional models in generating quality, evaluation accuracy, and strategy optimization [38][39]. - The correlation between virtual performance metrics and real-world outcomes was high, with a correlation coefficient of 0.87 for instruction-following rates [41][44]. - The model's ability to adapt to unseen camera layouts and generate coherent multi-view trajectories showcases its generalization capabilities [39]. Group 5: Future Directions - Despite its successes, Ctrl-World has room for improvement, particularly in adapting to complex physical scenarios and reducing sensitivity to initial observations [51][52]. - Future plans include integrating video generation with reinforcement learning for autonomous exploration of optimal strategies and expanding the training dataset to include more complex environments [53].
「宇树」向左,「智元」向右,「乐聚」蓄势而上
Robot猎场备忘录· 2025-10-30 03:02
Core Viewpoint - The article discusses the recent developments in the humanoid robot industry in China, focusing on three leading companies: Leju Robotics, Yushutech, and Zhiyuan Robotics, highlighting their IPO progress and differing strategies in technology, ecosystem, and commercialization [2][15]. Technology Route - Domestic humanoid robot startups can be categorized into two main camps: the "hardware faction" represented by Yushutech, emphasizing motion capabilities, and the "software faction" represented by Zhiyuan and Galaxy General, focusing on strong AI capabilities [5] - Leju Robotics, as one of the earliest developers in the humanoid robot sector, has achieved a full-stack technology capability, covering hardware and software [5][6]. Ecosystem Strategy - Leju Robotics adopts a more cautious and pragmatic approach compared to Zhiyuan Robotics, which employs an aggressive internet-style operational model [7] - Leju has invested in various companies to create a collaborative innovation ecosystem, enhancing its technological barriers [7] - The company has formed strategic partnerships with leading manufacturing firms to ensure stable supply chains and cost control [7][14]. Commercialization Path - The article outlines the market potential for humanoid robots, indicating that the ToC (consumer) market is larger than ToB (business) and ToG (government) markets, with varying degrees of difficulty in implementation [8] - Leju Robotics has successfully deployed its humanoid robot "Kua Fu" in industrial manufacturing, commercial services, and academic research, showcasing its versatility [12][15]. Competitive Landscape - Leju, Yushutech, and Zhiyuan represent different development paths in the humanoid robot sector, with Yushutech leveraging its first-mover advantage and price competitiveness in the research market [15] - Zhiyuan Robotics has established a comprehensive product lineup and is focusing on various commercialization scenarios, while Leju emphasizes a more practical approach to market entry [15].
具身智能领域最新世界模型综述:250篇paper带大家梳理主流框架与任务
具身智能之心· 2025-10-30 00:03
Core Insights - The article discusses the concept of world models in embodied AI, emphasizing their role as internal simulators that help agents perceive environments, take actions, and predict future states [1][2]. Group 1: World Models Overview - The research on world models has seen unprecedented growth due to the explosion of generative models, leading to a complex array of architectures and techniques lacking a unified framework [2]. - A novel three-axis classification method is proposed to categorize existing world models based on their functionality, temporal modeling, and spatial representation [6]. Group 2: Mathematical Principles - World models are typically modeled as partially observable Markov decision processes (POMDPs), focusing on learning compact latent states from partial observations and the transition dynamics between states [4]. - The training paradigm for world models often employs a "reconstruction-regularization" approach, which encourages the model to reconstruct observations from latent states while aligning posterior inference with prior predictions [9]. Group 3: Functional Positioning - World models can be categorized into decision-coupled and general-purpose types, with the former optimized for specific decision tasks and the latter serving as task-agnostic simulators [6][15][16]. - Decision-coupled models, like the Dreamer series, excel in task performance but may struggle with generalization due to their task-specific representations [15]. - General-purpose models aim for broader predictive capabilities and transferability across tasks, though they face challenges in computational complexity and real-time inference [16]. Group 4: Temporal Modeling - Temporal modeling can be divided into sequential reasoning and global prediction, with the former focusing on step-by-step simulation and the latter predicting entire future sequences in parallel [20][23]. - Sequential reasoning is beneficial for closed-loop control but may suffer from error accumulation over long predictions [20]. - Global prediction enhances computational efficiency and reduces error accumulation but may lack detailed local dynamics [23]. Group 5: Spatial Representation - Various strategies for spatial representation include global latent vectors, token feature sequences, spatial latent grids, and decomposed rendering representations [25][28][34][35]. - Global latent vectors compress scene states into low-dimensional variables, facilitating real-time control but potentially losing fine-grained spatial information [28]. - Token feature sequences allow for detailed representation of complex scenes but require extensive data and computational resources [29]. - Spatial latent grids maintain local topology and are prevalent in autonomous driving, while decomposed rendering supports high-fidelity image generation but struggles with dynamic scenes [34][35]. Group 6: Data Resources and Evaluation Metrics - Data resources for embodied AI can be categorized into simulation platforms, interactive benchmarks, offline datasets, and real robot platforms, each serving distinct purposes in training and evaluating world models [37]. - Evaluation metrics focus on pixel-level generation quality, state/semantic consistency, and task performance, with recent trends emphasizing physical compliance and causal consistency [40].
阿里新研究:统一了VLA和世界模型
3 6 Ke· 2025-10-29 10:32
Core Insights - WorldVLA is a unified framework that integrates Visual Language Action models (VLA) with world models, developed collaboratively by Alibaba DAMO Academy, Lakehead Laboratory, and Zhejiang University [1][4]. Group 1: Framework Overview - The world model predicts future images by understanding actions and images, aiming to learn the underlying physical laws of the environment to enhance action generation accuracy [2]. - The action model generates subsequent actions based on image observations, which not only aids visual understanding but also enhances the visual generation capability of the world model [2]. - Experimental results indicate that WorldVLA significantly outperforms independent action and world models, showcasing a mutual enhancement effect between the two [2][12]. Group 2: Model Architecture - WorldVLA utilizes three independent tokenizers for encoding images, text, and actions, initialized based on the Chameleon model [6]. - The image tokenizer employs a VQ-GAN model with a compression ratio of 16 and a codebook size of 8192, generating 256 tokens for 256×256 images and 1024 tokens for 512×512 images [6]. - The action tokenizer discretizes continuous robot actions into 256 intervals, represented by 7 tokens, including relative positions and angles [6]. Group 3: Training and Performance - WorldVLA employs a self-regressive training approach, where all text, actions, and images are tokenized and trained in a causal manner [8]. - A novel attention mask for action generation ensures that the current action generation relies solely on text and visual inputs, preventing errors from previous actions from affecting subsequent ones [10]. - Benchmark results show that even without pre-training, WorldVLA outperforms the discrete OpenVLA model, validating its architectural design [12]. Group 4: Mutual Benefits of Models - The introduction of the world model significantly enhances the performance of the action model by enabling it to learn the underlying physical laws of the system, which is crucial for tasks requiring precision [15]. - The world model provides predictive capabilities that inform decision-making processes, optimizing action selection strategies and improving task success rates [18]. - Conversely, the action model improves the quality of the world model's output, particularly in generating longer video sequences [21]. Group 5: Expert Opinions - Chen Long, Senior Research Director at Xiaomi Auto, emphasizes that VLA and world models do not need to be mutually exclusive; their combination can promote each other, leading to advancements in embodied intelligence (AGI) [24].
阿里新研究:统一了VLA和世界模型
量子位· 2025-10-29 09:30
Core Insights - WorldVLA is a unified framework that integrates Visual Language Action Models (VLA) with World Models, proposed by Alibaba DAMO Academy, Lake Lab, and Zhejiang University [1][4] - Experimental results indicate that WorldVLA significantly outperforms independent action models and world models, showcasing a mutual enhancement effect [2] Model Overview - The framework combines three independent tokenizers for encoding images, text, and actions, utilizing a VQ-GAN model for image tokenization with a compression ratio of 16 and a codebook size of 8192 [8] - The action tokenizer discretizes continuous robot actions into 256 intervals, representing actions with 7 tokens [8] Model Design - WorldVLA employs a self-regressive action world model to unify action and image understanding and generation [4] - The model addresses limitations of existing VLA and world models by enhancing action generation accuracy through environmental physical understanding [5][14] Training and Performance - WorldVLA is jointly trained by integrating data from both action models and world models, enhancing action generation capabilities [13] - The model's performance is positively correlated with image resolution, with 512x512 pixel resolution showing significant improvements over 256x256 [21][23] Benchmark Results - WorldVLA demonstrates superior performance compared to discrete OpenVLA models, even without pre-training, validating its architectural design [19] - The model's ability to generate coherent and physically plausible states in various scenarios is highlighted, outperforming pure world models [31][32] Mutual Enhancement - The world model enhances the action model's performance by predicting environmental state changes based on current actions, crucial for tasks requiring precision [25] - Conversely, the action model improves the visual understanding of the world model, supporting better visual generation [17][30]
极佳视界与湖北人形机器人创新中心将共建具身智能数据工厂
Xin Lang Cai Jing· 2025-10-28 15:33
Core Insights - A strategic partnership has been established between GigaVision and the Hubei Humanoid Robot Innovation Center to create a "world model-driven virtual-physical integrated embodied intelligent data factory" [1] - The collaboration includes the launch of a foundational model called GigaBrain-0, which utilizes world model-generated data for real machine generalization in visual-language-action (VLA) [1] Group 1 - The strategic cooperation aims to enhance the development of embodied intelligence technologies [1] - The GigaBrain-0 model represents a significant advancement in integrating visual, language, and action capabilities [1] - This partnership highlights the growing trend of combining AI and robotics in industrial applications [1]
全球首个世界模型具身智能数据工厂落址武汉
Zhong Guo Xin Wen Wang· 2025-10-28 09:10
中新网武汉10月28日电 (记者武一力)全球首个"世界模型驱动的虚实结合具身智能数据工厂"项目28日在 武汉东湖高新区签约。该工厂由湖北人形机器人创新中心与科技企业极佳视界共建,将成为让人形机器 人自主应对复杂现实的"超级课堂"。 湖北人形机器人创新中心相关负责人表示,工厂将借助世界模型技术,通过高真实度的世界模型,生成 大规模、多样性的合成数据,构建具身智能全面的数据体系。这些数据将支撑"一脑多形"具身基础模型 研发,赋能不同形态、不同任务的机器人本体,为机器人企业建立一个共享"资料库"。此外,工厂的建 设也将助力湖北打造全球知名的人形机器人产业高地。(完) (文章来源:中国新闻网) 10月28日,一台机器人在湖北人形机器人创新中心整理餐具。中新社记者武一力摄 极佳视界算法负责人叶云介绍,当前市面主流的工业臂机器人,普遍采用标准化编程系统,只能在特定 环境里做特定动作。想要让机器人要变得更聪明,需要大量学习、不断成长,这个工厂就能为机器人提 供充足的"学习资料"。 "世界模型作为一种能够模拟物理世界运行规律的先进技术,就像给机器人安装了'想象力引擎',不需 要预先编程。例如,一个瓶子意外掉落,机器人能实时感 ...
高通推AI芯片与英伟达竞争;美团骑手社保补贴上线丨科技风向标
Group 1: Technology Developments - Meituan launched the LongCat-Video model, capable of generating 5-minute videos with 13.6 billion parameters, aiming to enhance AI's understanding of the world through video generation tasks [2] - Qualcomm introduced new AI chips, AI200 and AI250, designed for data center AI inference, offering optimized performance and lower total cost of ownership [11] - Changjiang Storage announced the mass production of DDR5 fourth-generation RCD chips, achieving data transfer rates of up to 7200MT/s, a 12.5% improvement over the previous generation [12] Group 2: Business Initiatives - JD.com initiated the "National Good Car" delivery center recruitment plan, aiming to create a nationwide sales and service network by integrating various automotive service providers [3] - Meituan announced nationwide social security subsidies for delivery riders, allowing them to choose their insurance payment locations starting in November [4] - Yingyi Intelligent Manufacturing secured over 100 assembly orders from leading clients, enhancing its collaboration in hardware manufacturing and AI model development [7] Group 3: Financial Activities - Junsheng Electronics plans to issue approximately 155 million shares in Hong Kong, with a maximum price of HKD 23.60 per share, to fund R&D and global expansion [13] - Eagle Semiconductor completed a B+ round financing of over 700 million yuan, setting a record for VCSEL startups in China [15] - Guoyi Quantum raised 131 million yuan in strategic financing to enhance R&D and market expansion efforts [19] Group 4: Market Expansion - Didi launched 500 electric vehicles in Mexico, marking its first standardized ride-hailing service in Latin America [5] - Hengtong Optic-Electric won contracts for marine energy projects totaling 1.868 billion yuan, including a 1 million kW offshore wind project [8] - Zhenyu Technology plans to invest 2.11 billion yuan in precision components and humanoid robot modules, aiming to enhance its production capabilities [10]