Yume1.0
Search documents
让世界模型推理效率提升70倍:上海AI Lab用“恒算力”破解长时记忆与交互瓶颈
量子位· 2026-01-09 04:09
Core Insights - The article discusses the transition of generative AI from static images to dynamic videos, emphasizing the importance of building a "world model" that understands physical laws, possesses long-term memory, and supports real-time interaction as a pathway to achieving Artificial General Intelligence (AGI) [3]. Group 1: Yume Project Overview - The Yume project, developed by Shanghai AI Lab in collaboration with several top institutions, has released Yume1.0 and Yume1.5, which are the first fully open-source world models aimed at real-world applications [3][4]. - Yume1.5 introduces a core architectural innovation called Time-Space Channel Modeling (TSCM), which addresses the memory bottleneck in long video generation [4][11]. Group 2: Technical Innovations - TSCM employs a unified context compression and linear attention mechanism to solve the memory challenges associated with long video generation [5]. - The framework integrates long-term memory, real-time reasoning, and "text + keyboard" interaction control into a single system, demonstrating a feasible path for engineering world models [2]. Group 3: Data Utilization - Yume utilizes the Sekai dataset, which includes high-quality first-person (POV) video data covering 750 cities and totaling 5000 hours [8]. - Yume1.5 also incorporates a high-quality T2V synthesis dataset and a specialized event dataset for generating events like "sudden ghost appearances" [10]. Group 4: TSCM Mechanism - TSCM's compression mechanism includes two parallel streams: time-space compression and channel compression, effectively reducing the number of tokens processed [16]. - Time-space compression retains visual details by downsampling historical frames, while channel compression reduces the channel dimension to enhance processing efficiency [19][23]. Group 5: Performance Evaluation - Yume1.5 achieved an instruction-following (IF) score of 0.836, demonstrating the effectiveness of its control methods, and reduced generation time from 572 seconds in Yume1.0 to just 8 seconds [29]. - An ablation study showed that removing TSCM and using simple spatial compression led to a decrease in instruction-following ability from 0.836 to 0.767, highlighting TSCM's significance [30][32]. Group 6: Future Prospects - The open-sourcing of Yume and its datasets is expected to accelerate research in world models, with the potential for the distinction between "real" and "generated" content to become increasingly blurred in the near future [38].