世界模型
Search documents
Efficiency Law, 世界模型引擎驱动的具身智能学习新范式
具身智能之心· 2025-10-28 00:02
Core Insights - The article emphasizes the importance of addressing data generation issues in the field of embodied intelligence, highlighting that the previously overlooked data problems are fundamental to the successful implementation of this technology [2][5]. Group 1: Efficiency Law and Scaling Law - The article introduces the concept of "Efficiency Law," which is derived from the limitations of the "Scaling Law" in embodied intelligence. The Efficiency Law posits that the performance of embodied models is significantly influenced by the rate of high-quality data generation (r_D) within a limited timeframe [5][6]. - It is stated that a higher data generation rate (r_D) can enhance learning efficiency, while a lower rate leads to a "data scarcity zone," hindering model performance [6][20]. Group 2: World Models and Physical Accuracy - The necessity for absolute physical accuracy in world models is discussed, as embodied intelligence relies on understanding real-world physics to execute actions effectively. Models must adhere to physical laws to ensure reliable learning and decision-making [9][12]. - Current video-based world models are criticized for lacking physical correctness, as they primarily focus on visual realism rather than accurately simulating physical dynamics [8][12]. Group 3: GS-World and Its Applications - The GS-World model is presented as a novel approach that integrates generative models with physical simulation engines, allowing for the generation of physically accurate environments and interactions. This model addresses the shortcomings of traditional video-based models [11][13]. - GS-World is positioned as a transformative engine for embodied intelligence, enabling the autonomous generation of training data and facilitating high-fidelity strategy validation in simulated environments [15][20]. Group 4: Engine-Driven Learning Paradigm - The article outlines a shift from data-driven to engine-driven learning paradigms in embodied intelligence, where the GS-World engine allows for continuous interaction and feedback, fostering a self-evolving learning system [24][25]. - This new paradigm emphasizes the importance of generating and simulating physical worlds, enabling agents to learn and adapt through real-time interactions rather than relying solely on historical data [24][28]. Group 5: Robustness and Generalization - The need for embodied intelligence systems to achieve product-level success rates and robustness against environmental disturbances is highlighted. The engine-driven learning paradigm is deemed essential for developing reliable and trustworthy intelligent products [27][29]. - The GS-World model is described as a critical platform for evolving robotic skills, allowing for the natural emergence of skills through interaction within a physically accurate simulated environment [31][32].
发布并开源视频生成模型,美团在AI赛道潜行
Bei Jing Shang Bao· 2025-10-27 12:33
Core Insights - Meituan is advancing in the large model sector while facing fierce competition in the food delivery market, recently releasing and open-sourcing the LongCat-Video model, which can stably generate long videos of up to 5 minutes [2][4] - The company has made significant progress in large models, having released three major models since September, including LongCat-Flash-Chat and LongCat-Flash-Thinking, both achieving state-of-the-art (SOTA) performance in various tasks [3][8] - Meituan's strategic shift from "Food+Platform" to "Retail+Technology" emphasizes AI, robotics, and autonomous driving as core future directions, integrating these technologies into its business operations [7][8] Model Developments - The LongCat-Flash-Chat model features a mixture-of-experts architecture with 560 billion parameters, optimizing both computational efficiency and performance [3] - LongCat-Flash-Thinking has achieved SOTA in reasoning tasks across multiple domains, showcasing the company's commitment to advancing AI capabilities [3] - LongCat-Video is designed for coherent long video generation, demonstrating significant advantages in video generation tasks compared to competitors [4][5] Industry Perspective - Industry peers have mixed reactions to Meituan's advancements in video generation, with some expressing skepticism about the significance of achieving SOTA in a largely closed-source field [5][6] - The LongCat models are seen as a response to Meituan's internal content needs and potential applications in embodied intelligence [5][6] Strategic Vision - Meituan's LongCat team views the video generation model as a step towards exploring "world models," aiming to bridge the digital and physical worlds through advanced AI technologies [7] - The company's AI strategy includes enhancing employee efficiency, transforming existing products with AI, and developing proprietary large models, with a notable increase in API usage from 10% to 68% [8]
美团发布并开源视频生成模型:部分参数比肩谷歌最先进模型Veo3
Guan Cha Zhe Wang· 2025-10-27 10:52
Core Insights - Meituan's LongCat team has released and open-sourced the LongCat-Video model, achieving state-of-the-art (SOTA) performance in video generation tasks based on text and images [1][3]. Group 1: Model Features - LongCat-Video can generate coherent videos up to 5 minutes long, addressing common issues like frame drift and color inconsistency found in other models [3][6]. - The model supports 720p resolution and 30 frames per second, utilizing mechanisms like video continuation pre-training and block sparse attention to maintain temporal consistency and visual stability [6][9]. - LongCat-Video's inference speed has been enhanced by 10.1 times through a combination of two-stage coarse-to-fine generation, block sparse attention, and model distillation [6][8]. Group 2: Evaluation and Performance - In internal evaluations, LongCat-Video was assessed on text alignment, visual quality, motion quality, and overall performance, with a high correlation of 0.92 between human and automated evaluations [8][12]. - The model's visual quality score is nearly on par with Google's Veo3, surpassing other models like PixVerse-V5 and Wan2.2 in overall quality [8][12]. - LongCat-Video scored 70.94% in commonsense understanding, ranking first among open-source models, with an overall score of 62.11%, trailing only behind proprietary models like Veo3 and Vidu Q1 [12]. Group 3: Future Applications - The release of LongCat-Video is a significant step for Meituan towards building "world models," which are essential for simulating physical laws and scene logic in AI [3][13]. - Future applications may include autonomous driving simulations and embodied intelligence, where long-sequence modeling is crucial [13].
美团首个视频大模型开源,速度暴涨900%
3 6 Ke· 2025-10-27 09:13
Core Insights - Meituan has launched its first video generation model, LongCat-Video, designed for multi-task video generation, supporting text-to-video, image-to-video, and video continuation capabilities [1][2] - LongCat-Video addresses the challenge of generating long videos, natively supporting outputs of up to 5 minutes, while maintaining high temporal consistency and visual stability [1] - The model significantly enhances inference efficiency, achieving a speed increase of over 900% by employing a two-stage generation strategy and block sparse attention mechanisms [1][10][13] Model Features - LongCat-Video utilizes a unified task framework that allows it to handle three types of video generation tasks within a single model, reducing complexity and enhancing performance [9][10] - The model architecture is based on a Diffusion Transformer structure, integrating diffusion model capabilities with long-sequence modeling advantages [7] - A three-stage training process is implemented, progressively learning from low to high-resolution video tasks, and incorporating reinforcement learning to optimize performance across diverse tasks [9][10] Performance Evaluation - In the VBench public benchmark test, LongCat-Video scored second overall, with a notable first place in "common sense understanding" at 70.94%, outperforming several closed-source models [2][20] - The model demonstrates strong performance in visual quality and motion fluidity, although there is room for improvement in text alignment and image consistency [19][20] - LongCat-Video's visual quality score is nearly on par with Google's Veo3, indicating competitive capabilities in the video generation landscape [17][20] Future Implications - Meituan views LongCat-Video as a foundational step towards developing "world models," which could enhance its capabilities in robotics and autonomous driving [22] - The model's ability to generate realistic video content may facilitate better modeling of physical knowledge and integration with large language models in future applications [22]
「世界理解」维度看AI视频生成:Veo3和Sora2水平如何?新基准来了
量子位· 2025-10-27 08:26
Core Insights - The article discusses the significant advancements in Text-to-Video (T2V) models, particularly highlighting the recent success of Sora2 and questioning whether T2V models have achieved true "world model" capabilities [1] - A new evaluation framework called VideoVerse has been proposed to assess T2V models on their understanding of event causality, physical laws, and common sense, which are essential for a "world model" [1][3] Evaluation Framework - VideoVerse aims to evaluate T2V models based on two main perspectives: dynamic aspects (event following, mechanics, interaction, material properties, camera control) and static aspects (natural constraints, common sense, attribution correctness, 2D layout, 3D depth) [3] - Each prompt corresponds to several binary evaluation questions, with event following measured through sequence consistency using Longest Common Subsequence (LCS) [4][16] Prompt Construction - The team employs a multi-stage process to ensure the authenticity, diversity, and evaluability of prompts, sourcing data from daily life, scientific experiments, and science fiction [8][9] - Event and causal structures are extracted using advanced language models to convert natural language descriptions into event-level structures, laying the groundwork for evaluating "event following" [10][11] Evaluation Methodology - The evaluation combines QA and LCS scoring, focusing on event following, dimension-specific questions, and overall scoring that reflects both logical sequence and physical details [5][18] - The introduction of hidden semantics aims to assess whether models can generate implicit consequences that are not explicitly stated in prompts [20][22] Experimental Findings - The team evaluated various open-source and closed-source models, finding that open-source models perform comparably in basic dimensions but lag significantly in world model capabilities [28] - Even the strongest closed-source model, Sora2, shows notable deficiencies in "hidden semantics following" and certain physical/material inferences [29] Conclusion and Future Directions - VideoVerse provides a comprehensive evaluation framework aimed at shifting the focus from merely generating realistic visuals to understanding and simulating the world [40] - The team has open-sourced data, evaluation code, and a leaderboard, encouraging further research to enhance world model capabilities [41]
特斯拉世界模拟器亮相ICCV,VP亲自解密端到端自动驾驶技术路线
3 6 Ke· 2025-10-27 08:11
Core Insights - Tesla has unveiled a world simulator for generating realistic driving scenarios, which was presented by Ashok Elluswamy at the ICCV conference, emphasizing the future of intelligent driving lies in end-to-end AI [1][5][24] Group 1: World Simulator Features - The world simulator can create new challenging scenarios for autonomous driving tasks, such as vehicles suddenly changing lanes or AI navigating around pedestrians and obstacles [2] - The generated scenario videos serve dual purposes: training autonomous driving models and providing a gaming experience for human users [2][4] Group 2: End-to-End AI Approach - Elluswamy highlighted that end-to-end AI is the future of autonomous driving, utilizing data from various sensors to generate control commands for vehicles [5][8] - The end-to-end approach is contrasted with modular systems, which are easier to develop initially but lack the optimization and scalability of end-to-end systems [8][10] Group 3: Challenges and Solutions - One major challenge for end-to-end autonomous driving is evaluation, which the world simulator addresses by using a vast dataset to synthesize future states based on current conditions [11] - The complexity of real-world data, such as high frame rates and multiple sensor inputs, leads to a "curse of dimensionality," which Tesla mitigates by collecting extensive driving data to enhance model generalization [13][15] Group 4: Industry Perspectives - The industry is divided between two main approaches to end-to-end autonomous driving: VLA (Vision-Language-Action) and world models, with various companies adopting different strategies [24] - Tesla's choice of the end-to-end approach has garnered attention due to its historical success in the autonomous driving space, raising questions about the future direction of the technology [24]
美团LongCat-Video正式发布并开源 视频推理速度提升至10.1倍
Zheng Quan Ri Bao Wang· 2025-10-27 08:06
Core Insights - The LongCat team of Meituan has released and open-sourced the LongCat-Video video generation model, achieving state-of-the-art (SOTA) performance in foundational tasks of text-to-video and image-to-video generation, with significant advantages in long video generation [1][2] - The model is seen as a crucial step towards building "world models," which are essential for the next generation of artificial intelligence, allowing AI to understand and simulate the real world [1] Technical Features - LongCat-Video is based on a Diffusion Transformer architecture and supports three core tasks: text-to-video without conditional frames, image-to-video with one reference frame, and video continuation using multiple preceding frames, creating a complete task loop [2] - The model can generate stable 5-minute long videos without quality loss, addressing industry pain points such as color drift and motion discontinuity, ensuring temporal consistency and physical motion realism [2] - LongCat-Video employs a three-tier optimization strategy (C2F, BSA, and model distillation) to enhance video inference speed by 10.1 times, achieving an optimal balance between efficiency and quality [2] Performance Evaluation - The model evaluation includes both internal and public benchmark tests, covering text-to-video and image-to-video tasks, with a focus on multiple dimensions such as text alignment, image alignment, visual quality, motion quality, and overall quality [3] - LongCat-Video, with 13.6 billion parameters, has achieved SOTA performance in the open-source domain for both text-to-video and image-to-video tasks, demonstrating significant advantages in key metrics like text alignment and motion coherence [3]
马斯克「世界模拟器」首曝,1天蒸馏人类500年驾驶经验,擎天柱同脑进化
3 6 Ke· 2025-10-27 07:34
Core Insights - Tesla has unveiled its "World Simulator," a neural network that ingests 500 years of human driving experience daily to evolve in an infinite virtual environment, which can also be utilized by its humanoid robot, Optimus [1][2][3] Group 1: Technology Overview - The "World Simulator" generates various driving scenarios, including rare situations like pedestrians crossing the road and vehicles cutting in, allowing AI to simulate and test responses in a controlled environment [2][3] - Tesla employs an "end-to-end" neural network for autonomous driving, processing raw data from multiple cameras and other inputs to directly generate driving commands without separate modules for perception, prediction, and planning [6][7][9] - This approach allows the AI to learn human-like decision-making and reduces information loss between modules, enhancing overall system performance [13][16] Group 2: Data Utilization - Tesla's fleet generates a vast amount of data, equivalent to 500 years of human driving experience daily, which is filtered to extract high-quality learning samples for the AI [25][27] - The AI's ability to generalize from complex scenarios, such as predicting vehicle behavior in adverse weather, is attributed to exposure to diverse driving conditions [30] Group 3: Simulation Capabilities - The "World Simulator" can evaluate new AI models in a closed-loop environment, recreate real-world dangerous scenarios for testing, and generate extreme situations to challenge the AI's limits [46] - This simulator serves as a foundational AI engine that extends beyond automotive applications, also being applicable to Tesla's humanoid robot project, Optimus [47][48]
美团LongCat-Video视频生成模型发布:可输出5分钟长视频
Feng Huang Wang· 2025-10-27 07:32
Core Insights - Meituan officially announced the release of the LongCat-Video video generation model, which is based on the Diffusion Transformer architecture and supports three core tasks: text-to-video, image-to-video, and video continuation [1] Model Features - LongCat-Video can generate high-definition videos at 720p resolution and 30 frames per second, with the ability to create coherent video content lasting up to 5 minutes [1] - The model addresses common issues in long video generation, such as frame breaks and quality degradation, by maintaining temporal consistency and motion rationality through video continuation pre-training and block sparse attention mechanisms [1] Efficiency and Performance - The model employs two-stage generation, block sparse attention, and model distillation techniques, reportedly achieving over a 10x improvement in inference speed [1] - With a parameter count of 13.6 billion, LongCat-Video has demonstrated strong performance in text alignment and motion continuity in public tests like VBench [1] Future Applications - As part of the effort to build a "world model," LongCat-Video may find applications in scenarios requiring long-term sequence modeling, such as autonomous driving simulations and embodied intelligence [1] - The release of this model signifies a significant advancement for Meituan in the fields of video generation and physical world simulation [1]
美团视频生成模型来了!一出手就是开源SOTA
量子位· 2025-10-27 05:37
Core Viewpoint - Meituan has launched an open-source video model named LongCat-Video, which supports text-to-video and image-to-video generation, showcasing significant advancements in video generation technology [1][39]. Group 1: Model Features - LongCat-Video has 13.6 billion parameters and can generate videos lasting up to five minutes, demonstrating a strong understanding of real-world physics and semantics [1][12][39]. - The model excels in generating 720p, 30fps videos with high semantic understanding and visual presentation capabilities, ranking among the best in open-source models [18][62]. - It can maintain consistency in generated videos, addressing challenges such as detail capture and complex lighting effects [19][24]. Group 2: Technical Innovations - LongCat-Video integrates three main tasks: text-to-video, image-to-video, and video continuation, using a Diffusion Transformer framework [41]. - The model employs a unique training approach that directly pre-trains on video continuation tasks, mitigating cumulative errors in long video generation [46][48]. - It utilizes advanced techniques like block sparse attention and a from-coarse-to-fine generation paradigm to enhance video generation efficiency [52][53]. Group 3: Performance Evaluation - In internal benchmarks, LongCat-Video outperformed models like PixVerse-V5 and Wan2.2-T2V-A14B in overall quality, with strong performance in visual quality and motion quality [62][63]. - The model achieved a top score in common-sense dimensions, indicating its superior ability to model the physical world [64]. Group 4: Broader Context - This is not the first instance of Meituan venturing into AI; the company has previously released various models, including LongCat-Flash-Chat and LongCat-Flash-Thinking, showcasing its commitment to AI innovation [65][68].