世界模型
Search documents
阿里新研究:统一了VLA和世界模型
3 6 Ke· 2025-10-29 10:32
Core Insights - WorldVLA is a unified framework that integrates Visual Language Action models (VLA) with world models, developed collaboratively by Alibaba DAMO Academy, Lakehead Laboratory, and Zhejiang University [1][4]. Group 1: Framework Overview - The world model predicts future images by understanding actions and images, aiming to learn the underlying physical laws of the environment to enhance action generation accuracy [2]. - The action model generates subsequent actions based on image observations, which not only aids visual understanding but also enhances the visual generation capability of the world model [2]. - Experimental results indicate that WorldVLA significantly outperforms independent action and world models, showcasing a mutual enhancement effect between the two [2][12]. Group 2: Model Architecture - WorldVLA utilizes three independent tokenizers for encoding images, text, and actions, initialized based on the Chameleon model [6]. - The image tokenizer employs a VQ-GAN model with a compression ratio of 16 and a codebook size of 8192, generating 256 tokens for 256×256 images and 1024 tokens for 512×512 images [6]. - The action tokenizer discretizes continuous robot actions into 256 intervals, represented by 7 tokens, including relative positions and angles [6]. Group 3: Training and Performance - WorldVLA employs a self-regressive training approach, where all text, actions, and images are tokenized and trained in a causal manner [8]. - A novel attention mask for action generation ensures that the current action generation relies solely on text and visual inputs, preventing errors from previous actions from affecting subsequent ones [10]. - Benchmark results show that even without pre-training, WorldVLA outperforms the discrete OpenVLA model, validating its architectural design [12]. Group 4: Mutual Benefits of Models - The introduction of the world model significantly enhances the performance of the action model by enabling it to learn the underlying physical laws of the system, which is crucial for tasks requiring precision [15]. - The world model provides predictive capabilities that inform decision-making processes, optimizing action selection strategies and improving task success rates [18]. - Conversely, the action model improves the quality of the world model's output, particularly in generating longer video sequences [21]. Group 5: Expert Opinions - Chen Long, Senior Research Director at Xiaomi Auto, emphasizes that VLA and world models do not need to be mutually exclusive; their combination can promote each other, leading to advancements in embodied intelligence (AGI) [24].
阿里新研究:统一了VLA和世界模型
量子位· 2025-10-29 09:30
Core Insights - WorldVLA is a unified framework that integrates Visual Language Action Models (VLA) with World Models, proposed by Alibaba DAMO Academy, Lake Lab, and Zhejiang University [1][4] - Experimental results indicate that WorldVLA significantly outperforms independent action models and world models, showcasing a mutual enhancement effect [2] Model Overview - The framework combines three independent tokenizers for encoding images, text, and actions, utilizing a VQ-GAN model for image tokenization with a compression ratio of 16 and a codebook size of 8192 [8] - The action tokenizer discretizes continuous robot actions into 256 intervals, representing actions with 7 tokens [8] Model Design - WorldVLA employs a self-regressive action world model to unify action and image understanding and generation [4] - The model addresses limitations of existing VLA and world models by enhancing action generation accuracy through environmental physical understanding [5][14] Training and Performance - WorldVLA is jointly trained by integrating data from both action models and world models, enhancing action generation capabilities [13] - The model's performance is positively correlated with image resolution, with 512x512 pixel resolution showing significant improvements over 256x256 [21][23] Benchmark Results - WorldVLA demonstrates superior performance compared to discrete OpenVLA models, even without pre-training, validating its architectural design [19] - The model's ability to generate coherent and physically plausible states in various scenarios is highlighted, outperforming pure world models [31][32] Mutual Enhancement - The world model enhances the action model's performance by predicting environmental state changes based on current actions, crucial for tasks requiring precision [25] - Conversely, the action model improves the visual understanding of the world model, supporting better visual generation [17][30]
极佳视界与湖北人形机器人创新中心将共建具身智能数据工厂
Xin Lang Cai Jing· 2025-10-28 15:33
Core Insights - A strategic partnership has been established between GigaVision and the Hubei Humanoid Robot Innovation Center to create a "world model-driven virtual-physical integrated embodied intelligent data factory" [1] - The collaboration includes the launch of a foundational model called GigaBrain-0, which utilizes world model-generated data for real machine generalization in visual-language-action (VLA) [1] Group 1 - The strategic cooperation aims to enhance the development of embodied intelligence technologies [1] - The GigaBrain-0 model represents a significant advancement in integrating visual, language, and action capabilities [1] - This partnership highlights the growing trend of combining AI and robotics in industrial applications [1]
全球首个世界模型具身智能数据工厂落址武汉
Zhong Guo Xin Wen Wang· 2025-10-28 09:10
中新网武汉10月28日电 (记者武一力)全球首个"世界模型驱动的虚实结合具身智能数据工厂"项目28日在 武汉东湖高新区签约。该工厂由湖北人形机器人创新中心与科技企业极佳视界共建,将成为让人形机器 人自主应对复杂现实的"超级课堂"。 湖北人形机器人创新中心相关负责人表示,工厂将借助世界模型技术,通过高真实度的世界模型,生成 大规模、多样性的合成数据,构建具身智能全面的数据体系。这些数据将支撑"一脑多形"具身基础模型 研发,赋能不同形态、不同任务的机器人本体,为机器人企业建立一个共享"资料库"。此外,工厂的建 设也将助力湖北打造全球知名的人形机器人产业高地。(完) (文章来源:中国新闻网) 10月28日,一台机器人在湖北人形机器人创新中心整理餐具。中新社记者武一力摄 极佳视界算法负责人叶云介绍,当前市面主流的工业臂机器人,普遍采用标准化编程系统,只能在特定 环境里做特定动作。想要让机器人要变得更聪明,需要大量学习、不断成长,这个工厂就能为机器人提 供充足的"学习资料"。 "世界模型作为一种能够模拟物理世界运行规律的先进技术,就像给机器人安装了'想象力引擎',不需 要预先编程。例如,一个瓶子意外掉落,机器人能实时感 ...
高通推AI芯片与英伟达竞争;美团骑手社保补贴上线丨科技风向标
2 1 Shi Ji Jing Ji Bao Dao· 2025-10-28 03:49
Group 1: Technology Developments - Meituan launched the LongCat-Video model, capable of generating 5-minute videos with 13.6 billion parameters, aiming to enhance AI's understanding of the world through video generation tasks [2] - Qualcomm introduced new AI chips, AI200 and AI250, designed for data center AI inference, offering optimized performance and lower total cost of ownership [11] - Changjiang Storage announced the mass production of DDR5 fourth-generation RCD chips, achieving data transfer rates of up to 7200MT/s, a 12.5% improvement over the previous generation [12] Group 2: Business Initiatives - JD.com initiated the "National Good Car" delivery center recruitment plan, aiming to create a nationwide sales and service network by integrating various automotive service providers [3] - Meituan announced nationwide social security subsidies for delivery riders, allowing them to choose their insurance payment locations starting in November [4] - Yingyi Intelligent Manufacturing secured over 100 assembly orders from leading clients, enhancing its collaboration in hardware manufacturing and AI model development [7] Group 3: Financial Activities - Junsheng Electronics plans to issue approximately 155 million shares in Hong Kong, with a maximum price of HKD 23.60 per share, to fund R&D and global expansion [13] - Eagle Semiconductor completed a B+ round financing of over 700 million yuan, setting a record for VCSEL startups in China [15] - Guoyi Quantum raised 131 million yuan in strategic financing to enhance R&D and market expansion efforts [19] Group 4: Market Expansion - Didi launched 500 electric vehicles in Mexico, marking its first standardized ride-hailing service in Latin America [5] - Hengtong Optic-Electric won contracts for marine energy projects totaling 1.868 billion yuan, including a 1 million kW offshore wind project [8] - Zhenyu Technology plans to invest 2.11 billion yuan in precision components and humanoid robot modules, aiming to enhance its production capabilities [10]
世界模型==VQA?机器人不用想象画面,预测语义就够了
机器之心· 2025-10-28 00:41
Core Insights - The article discusses the necessity of precise future predictions in world models for AI, questioning whether detailed visual representations are essential for decision-making [1][6] - It introduces the concept of the Semantic World Model (SWM), which focuses on predicting semantic information about future outcomes rather than generating visual frames [9][18] Summary by Sections World Models and Their Limitations - World models enable AI to learn the dynamics of the world and predict future events based on current states [6] - Traditional models often generate realistic images but may miss critical semantic details necessary for decision-making [7][8] Semantic World Model (SWM) - SWM redefines world modeling as a visual question-answering (VQA) problem, focusing on task-relevant interactions rather than raw visual data [8][9] - SWM utilizes a visual language model (VLM) to answer questions about future actions and their semantic effects [9][11] Training and Data Generation - SWM can be trained on low-quality sequence data, including both expert and non-expert data, making it versatile [15] - A dataset called SAQA (State-Action-Question-Answer) is generated to train the model effectively [22] Experimental Results - SWM demonstrated high accuracy in answering future outcome questions and showed generalization capabilities in new scenarios [17] - In multi-task simulations, SWM significantly improved performance compared to baseline models, achieving success rates of 81.6% in LangTable and 76% in OGBench [30][34] Generalization and Robustness - SWM retains the generalization capabilities of the underlying VLM, showing improvements in performance even with new object combinations and background changes [39][41] - The model's attention mechanisms focus on task-relevant information, indicating its ability to generalize across different scenarios [41]
郑智化就“连滚带爬”表述致歉;春秋航空招聘已婚已育空嫂;宗馥莉心腹祝丽丹离职;安徽成汽车产量第一省;长安汽车一4S店起火丨邦早报
创业邦· 2025-10-28 00:10
Group 1 - Zhu Lidan, the legal representative of Hongsheng Group controlled by Zong Fuli, has left the company, with her office now assigned to Kou Jing [3][4] - Zhu Lidan has been a core member of Hongsheng Group and has had a long-standing working relationship with Zong Fuli [3] - There are reports that Zhu Lidan was summoned by relevant authorities twice since September, and her position was previously marked as "to be determined" [4] Group 2 - Changan Automobile confirmed a fire incident at a 4S store in Anhui, but no information on the cause of the fire has been provided [6] - Meituan announced a nationwide rollout of pension insurance subsidies for delivery riders starting in November, marking the first such scheme available to all riders [12][13] - Spring Airlines has launched a recruitment campaign for "air sisters," targeting married women with children and expanding the age limit to 40 years [13] Group 3 - JD.com has been granted an insurance brokerage license in Hong Kong, marking its entry into the financial market [13] - Tesla's board chair warned that if Elon Musk's $1 trillion compensation plan is not approved, the company may face significant value loss [13] - High-profile education company Gaotu is under investigation for allegedly organizing illegal offline subject training in Beijing [13] Group 4 - Amazon plans to invest over €1.4 billion in the Netherlands over the next three years, the largest investment commitment since entering the market [14] - Porsche responded to reports of multiple gasoline vehicle discontinuations, clarifying that the fuel version of Macan is not included in the changes [15] - AI startup Mercor raised $350 million at a valuation of $10 billion, with participation from notable investors [15][16] Group 5 - The global mobile game in-app purchase revenue is expected to increase by 6% to $85.4 billion by 2025 [20] - China is projected to generate over 400 million discarded mobile phones annually, with low recycling prices and privacy concerns hindering recovery efforts [20] - Anhui has become the top province in automobile production, with 15 provinces expected to produce over one million vehicles this year [20]
今年CVPR,自动驾驶还能冲什么方向?
自动驾驶之心· 2025-10-28 00:03
Core Viewpoint - The article emphasizes the importance of targeted guidance and mentorship for students aiming to publish high-quality papers in top conferences like CVPR and ICLR, highlighting the need for strategic efforts in the final stages of the submission process [1][2][4]. Group 1: Submission Guidance - The article mentions that the majority of accepted papers in past conferences focus on localized breakthroughs and verifiable improvements, aligning closely with the main themes of the respective years [1]. - It suggests that the main theme for CVPR 2026 is likely to be "world models," indicating a strategic direction for potential submissions [1]. - The article encourages students to leverage the experiences of predecessors to enhance their submission quality, particularly in the final stages of preparation [2]. Group 2: Mentorship and Support - The organization, "Automated Driving Heart," is described as the largest AI technology media platform in China, with extensive academic resources and a deep understanding of the challenges in interdisciplinary fields like autonomous driving and robotics [3]. - The article highlights the success rate of their mentorship program, with a 96% acceptance rate for students over the past three years, indicating the effectiveness of their guidance [5]. - It outlines the personalized support provided, including assistance with research thinking, familiarization with research processes, and practical application of theoretical models [7][13]. Group 3: Program Structure and Offerings - The article details the structured support offered, including personalized paper guidance, real-time interaction with mentors, and unlimited access to recorded sessions for review [13]. - It specifies that the program caters to various academic levels and goals, from foundational courses for beginners to advanced mentorship for experienced researchers [17][19]. - The organization also provides opportunities for outstanding students, such as recommendations to prestigious institutions and direct referrals to leading tech companies [19].
TeraSim World:用开源方式重建「特斯拉式」世界模型
自动驾驶之心· 2025-10-28 00:03
Core Viewpoint - Tesla has showcased its internal World Model, a neural network-driven virtual world generator that synthesizes high-resolution videos from eight camera perspectives based on vehicle states and control inputs, enabling real-time environmental predictions and closed-loop validation [2][6]. Group 1: Tesla's World Model - Tesla's World Model allows for the replay of historical problem scenarios and the injection of new adversarial events in a virtual environment for testing and reinforcement learning [2]. - The model learns a general mapping of "perception-action-world change," making it applicable to other platforms like robotics, thus forming a basis for general physical intelligence [2]. Group 2: TeraSim World Framework - A research team from the University of Michigan, SaferDrive AI, the University of Hong Kong, and Tsinghua University has developed TeraSim World, an open-source framework that achieves similar generation and evaluation capabilities as Tesla's World Model without requiring real maps or sensor backgrounds [5][6]. - TeraSim World is designed to automatically generate city environments and traffic behaviors using AI, creating a fully data-driven, reproducible, and scalable world model platform [5]. Group 3: System Features - TeraSim World features a modular, fully automated data synthesis pipeline for generating realistic and safety-critical data for end-to-end autonomous driving [7]. - The system retrieves real-world road maps and converts them into simulation-ready formats, allowing for the automatic generation of digital maps based on user input [10][11]. - It can simulate realistic traffic conditions by automatically obtaining real-time traffic data, thus reflecting local traffic patterns [13]. Group 4: Agent and Sensor Simulation - The agent simulation component enables virtual vehicles, pedestrians, and cyclists to behave like their real-world counterparts, incorporating human driving characteristics [16]. - TeraSim World introduces safety-critical scenarios based on real-world accident probabilities, ensuring the generated events are both risky and realistic [17]. - The sensor simulation aspect generates realistic camera inputs and can be extended to other sensor types, utilizing NVIDIA's open-source Cosmos models for high-resolution, time-synchronized multi-view video generation [19][22][25]. Group 5: Automated Stress Testing - TeraSim World supports automated full-stack stress testing, generating and validating various risk scenarios to assess the stability and safety boundaries of autonomous driving systems [30]. - The framework can inject dynamic and static risks, such as sudden stops or environmental changes, to evaluate system responses under diverse conditions [30]. Group 6: Conclusion and Future Plans - TeraSim World combines agent and sensor simulation to provide a comprehensive data generation process for training and testing autonomous driving systems without the need for real-world data collection [31]. - The system aims to create a large-scale synthetic driving dataset and expand to multi-modal sensor simulations, establishing an open virtual testing ground for researchers and developers [32].
Efficiency Law, 世界模型引擎驱动的具身智能学习新范式
具身智能之心· 2025-10-28 00:02
Core Insights - The article emphasizes the importance of addressing data generation issues in the field of embodied intelligence, highlighting that the previously overlooked data problems are fundamental to the successful implementation of this technology [2][5]. Group 1: Efficiency Law and Scaling Law - The article introduces the concept of "Efficiency Law," which is derived from the limitations of the "Scaling Law" in embodied intelligence. The Efficiency Law posits that the performance of embodied models is significantly influenced by the rate of high-quality data generation (r_D) within a limited timeframe [5][6]. - It is stated that a higher data generation rate (r_D) can enhance learning efficiency, while a lower rate leads to a "data scarcity zone," hindering model performance [6][20]. Group 2: World Models and Physical Accuracy - The necessity for absolute physical accuracy in world models is discussed, as embodied intelligence relies on understanding real-world physics to execute actions effectively. Models must adhere to physical laws to ensure reliable learning and decision-making [9][12]. - Current video-based world models are criticized for lacking physical correctness, as they primarily focus on visual realism rather than accurately simulating physical dynamics [8][12]. Group 3: GS-World and Its Applications - The GS-World model is presented as a novel approach that integrates generative models with physical simulation engines, allowing for the generation of physically accurate environments and interactions. This model addresses the shortcomings of traditional video-based models [11][13]. - GS-World is positioned as a transformative engine for embodied intelligence, enabling the autonomous generation of training data and facilitating high-fidelity strategy validation in simulated environments [15][20]. Group 4: Engine-Driven Learning Paradigm - The article outlines a shift from data-driven to engine-driven learning paradigms in embodied intelligence, where the GS-World engine allows for continuous interaction and feedback, fostering a self-evolving learning system [24][25]. - This new paradigm emphasizes the importance of generating and simulating physical worlds, enabling agents to learn and adapt through real-time interactions rather than relying solely on historical data [24][28]. Group 5: Robustness and Generalization - The need for embodied intelligence systems to achieve product-level success rates and robustness against environmental disturbances is highlighted. The engine-driven learning paradigm is deemed essential for developing reliable and trustworthy intelligent products [27][29]. - The GS-World model is described as a critical platform for evolving robotic skills, allowing for the natural emergence of skills through interaction within a physically accurate simulated environment [31][32].