Workflow
视频世界模型
icon
Search documents
谢赛宁也玩MC?开源全新世界模型生成多人一致的游戏视角
机器之心· 2026-03-07 04:20
Core Insights - The article discusses the significant role of video games in advancing AI technology, particularly in training AI to understand physical interactions and world models [1] - The research team led by Xie Saining is exploring new directions in world models using the game "Minecraft" [3] Group 1: Video World Model Development - The Solaris model is the first multiplayer video world model that generates consistent first-person views for multiple players [5] - The research team identified that existing video world models only handle single-player perspectives, which do not accurately reflect real-world interactions [7] - SolarisEngine, a custom-built data collection system, supports coordinated multi-agent interactions and visual capture in games like "Minecraft" [7][14] Group 2: Data Collection and Model Training - The team collected a dataset of 12.6 million frames from 9,240 task rounds, focusing on various tasks such as building, combat, and exploration [16] - The dataset is the first of its kind with action annotations suitable for training world models [17] - The model utilizes a combination of flow matching and diffusion forcing to predict future observations based on players' historical actions [19] Group 3: Model Architecture and Improvements - The model architecture includes enhancements such as expanded action space and multi-player self-attention layers to facilitate information exchange among players [20][22] - The improvements allow the model to generalize to any number of players, although it has currently been trained with data from two players [20] Group 4: Evaluation Metrics and Results - The Solaris Eval dataset was created to test five collaborative abilities, including movement, positioning, consistency, memory, and building capabilities [24][28] - The results indicate that the Solaris model outperforms previous methods in visual quality and quantitative metrics across various evaluation categories [27][29]
VerseCrafter:给视频世界模型装上4D方向盘,精准运镜控物
机器之心· 2026-01-18 04:05
Core Insights - The article discusses the breakthrough in video world modeling with the introduction of VerseCrafter, a dynamic and realistic video world model that utilizes explicit 4D geometric control to enhance video generation capabilities [2][3]. Group 1: VerseCrafter Overview - VerseCrafter is developed by researchers from Fudan University, Shanghai Chuangzhi Academy, Hong Kong University, and Tencent PCG ARC Lab, aiming to address the limitations of existing video models that struggle with 2D playback versus the 4D nature of the real world [2][3]. - The core concept of VerseCrafter is to drive video generation using a unified 4D geometric world state, allowing for decoupled and coordinated control of camera and object movements [5][31]. Group 2: Technical Innovations - VerseCrafter introduces a unified 4D geometric control representation, moving beyond traditional 2D control signals to a method based on 3D Gaussians, which provides a flexible representation of object movements [9][11]. - The framework employs a frozen video prior from the powerful open-source video generation model Wan2.1, combined with a lightweight GeoAdapter to ensure high-quality video generation while incorporating precise 4D control [12][13]. Group 3: Data Collection and Training - The construction of the VerseControl4D dataset addresses the challenge of obtaining a large volume of real-world videos with precise 4D annotations, filling a significant gap in the training data for the model [15][19]. - The dataset includes 35,000 training video clips, utilizing advanced tools for automated annotation to extract 4D geometric information from high-quality video datasets [24]. Group 4: Experimental Results - VerseCrafter outperforms existing state-of-the-art methods in various metrics, demonstrating remarkable stability in controlling both camera movements and object dynamics in complex scenes [21][22]. - In static scenes, VerseCrafter also excels as a "scene roaming" tool, maintaining structural integrity and texture clarity during extensive camera movements [27][28]. - The model supports multi-player view generation, allowing for consistent video outputs from different perspectives of the same dynamic event [29]. Group 5: Implications and Future Applications - The introduction of VerseCrafter marks a significant advancement in video generation towards controllable 4D world simulation, opening new possibilities for game development, film pre-visualization, and embodied intelligence simulations [31].
「视频世界模型」新突破:AI连续生成5分钟,画面也不崩
机器之心· 2025-12-31 09:31
Core Insights - The article discusses the emergence of AI-generated videos and the challenges of creating videos that not only look realistic but also adhere to the laws of the physical world, which is the focus of the "Video World Model" [2] - The LongVie 2 framework is introduced as a solution to generate high-fidelity, controllable videos lasting up to 5 minutes, addressing the limitations of existing models [2][6] Group 1: Challenges in Current Video Models - Current video world models face a common issue where increasing generation length leads to a decline in controllability, visual fidelity, and temporal consistency [6] - The degradation of quality in long video generation is nearly unavoidable, with issues such as visual degradation and logical inconsistencies becoming significant bottlenecks [2][12] Group 2: LongVie 2 Framework - LongVie 2 employs a three-stage progressive training strategy to enhance controllability, stability, and temporal consistency [9][14] - Stage 1 focuses on Dense & Sparse multimodal control, utilizing dense signals (like depth maps) and sparse signals (like keypoint trajectories) to provide stable and interpretable world constraints [9] - Stage 2 introduces degradation-aware training, where the model learns to maintain stability in generation despite imperfect inputs, significantly improving long-term visual fidelity [13] - Stage 3 incorporates historical context modeling, explicitly integrating information from previous segments to ensure smoother transitions and reduce semantic breaks [14] Group 3: Performance Metrics - LongVie 2 demonstrates superior controllability compared to existing methods, achieving state-of-the-art (SOTA) levels in various metrics [21][29] - Ablation studies validate the effectiveness of the three-stage training approach, showing improvements in quality, controllability, and temporal consistency across multiple indicators [26] Group 4: LongVGenBench - The article introduces LongVGenBench, the first standardized benchmark dataset designed for controllable long video generation, containing 100 high-resolution videos over 1 minute in length [28] - This benchmark aims to facilitate systematic research and fair evaluation in the field of long video generation [28]
英伟达主管!具身智能机器人年度总结
具身智能之心· 2025-12-29 12:50
Core Insights - The robotics field is still in its early stages, as highlighted by Jim Fan, NVIDIA's robotics head, indicating a lack of standardized evaluation metrics and the disparity between hardware advancements and software reliability [1][8][11]. Group 1: Hardware and Software Disparity - Current advancements in robotics hardware, such as Optimus and e-Atlas, outpace software development, leading to underutilization of hardware capabilities [14][15]. - The need for extensive operational teams to manage robots is emphasized, as they do not self-repair and face frequent issues like overheating and motor failures [16][17]. - The reliability of hardware is crucial, as errors can lead to irreversible consequences, impacting the overall patience and scalability of the robotics field [18][19]. Group 2: Benchmarking Challenges - The lack of consensus on benchmarking in robotics is a significant issue, with no standardized hardware platforms or task definitions, leading to everyone claiming to achieve state-of-the-art (SOTA) results [20][21]. - The field must improve reproducibility and scientific standards to avoid treating them as secondary concerns [23]. Group 3: VLA Model Insights - The Vision-Language-Action (VLA) model is currently the dominant paradigm in robotics, but its reliance on pre-trained Vision-Language Models (VLM) presents challenges due to misalignment with physical world tasks [25][49]. - The VLA model's performance does not scale linearly with VLM parameters, as the pre-training objectives do not align with the requirements for physical interactions [26][51]. - Future VLA models should integrate physical-driven world models to enhance their ability to understand and interact with the physical environment [50]. Group 4: Data Importance - Data plays a critical role in shaping model capabilities, with the need for diverse data sources and collection methods being highlighted [31][43]. - The emergence of new hardware and data collection methods, such as Generalist and Egocentric-10K, demonstrates the growing importance of data in the robotics field [36][42]. - The current data collection strategies remain open-ended, with various approaches still being explored [43]. Group 5: Industry Trends - The robotics industry is projected to grow significantly, from $91 billion currently to $25 trillion by 2050, indicating a strong future potential [57]. - Major tech companies, excluding Microsoft and Anthropic, are increasingly investing in robotics software and hardware, reflecting the sector's attractiveness [59].