Workflow
世界模型
icon
Search documents
发布并开源视频生成模型,美团在AI赛道潜行
Bei Jing Shang Bao· 2025-10-27 12:33
Core Insights - Meituan is advancing in the large model sector while facing fierce competition in the food delivery market, recently releasing and open-sourcing the LongCat-Video model, which can stably generate long videos of up to 5 minutes [2][4] - The company has made significant progress in large models, having released three major models since September, including LongCat-Flash-Chat and LongCat-Flash-Thinking, both achieving state-of-the-art (SOTA) performance in various tasks [3][8] - Meituan's strategic shift from "Food+Platform" to "Retail+Technology" emphasizes AI, robotics, and autonomous driving as core future directions, integrating these technologies into its business operations [7][8] Model Developments - The LongCat-Flash-Chat model features a mixture-of-experts architecture with 560 billion parameters, optimizing both computational efficiency and performance [3] - LongCat-Flash-Thinking has achieved SOTA in reasoning tasks across multiple domains, showcasing the company's commitment to advancing AI capabilities [3] - LongCat-Video is designed for coherent long video generation, demonstrating significant advantages in video generation tasks compared to competitors [4][5] Industry Perspective - Industry peers have mixed reactions to Meituan's advancements in video generation, with some expressing skepticism about the significance of achieving SOTA in a largely closed-source field [5][6] - The LongCat models are seen as a response to Meituan's internal content needs and potential applications in embodied intelligence [5][6] Strategic Vision - Meituan's LongCat team views the video generation model as a step towards exploring "world models," aiming to bridge the digital and physical worlds through advanced AI technologies [7] - The company's AI strategy includes enhancing employee efficiency, transforming existing products with AI, and developing proprietary large models, with a notable increase in API usage from 10% to 68% [8]
美团发布并开源视频生成模型:部分参数比肩谷歌最先进模型Veo3
Guan Cha Zhe Wang· 2025-10-27 10:52
Core Insights - Meituan's LongCat team has released and open-sourced the LongCat-Video model, achieving state-of-the-art (SOTA) performance in video generation tasks based on text and images [1][3]. Group 1: Model Features - LongCat-Video can generate coherent videos up to 5 minutes long, addressing common issues like frame drift and color inconsistency found in other models [3][6]. - The model supports 720p resolution and 30 frames per second, utilizing mechanisms like video continuation pre-training and block sparse attention to maintain temporal consistency and visual stability [6][9]. - LongCat-Video's inference speed has been enhanced by 10.1 times through a combination of two-stage coarse-to-fine generation, block sparse attention, and model distillation [6][8]. Group 2: Evaluation and Performance - In internal evaluations, LongCat-Video was assessed on text alignment, visual quality, motion quality, and overall performance, with a high correlation of 0.92 between human and automated evaluations [8][12]. - The model's visual quality score is nearly on par with Google's Veo3, surpassing other models like PixVerse-V5 and Wan2.2 in overall quality [8][12]. - LongCat-Video scored 70.94% in commonsense understanding, ranking first among open-source models, with an overall score of 62.11%, trailing only behind proprietary models like Veo3 and Vidu Q1 [12]. Group 3: Future Applications - The release of LongCat-Video is a significant step for Meituan towards building "world models," which are essential for simulating physical laws and scene logic in AI [3][13]. - Future applications may include autonomous driving simulations and embodied intelligence, where long-sequence modeling is crucial [13].
美团首个视频大模型开源,速度暴涨900%
3 6 Ke· 2025-10-27 09:13
Core Insights - Meituan has launched its first video generation model, LongCat-Video, designed for multi-task video generation, supporting text-to-video, image-to-video, and video continuation capabilities [1][2] - LongCat-Video addresses the challenge of generating long videos, natively supporting outputs of up to 5 minutes, while maintaining high temporal consistency and visual stability [1] - The model significantly enhances inference efficiency, achieving a speed increase of over 900% by employing a two-stage generation strategy and block sparse attention mechanisms [1][10][13] Model Features - LongCat-Video utilizes a unified task framework that allows it to handle three types of video generation tasks within a single model, reducing complexity and enhancing performance [9][10] - The model architecture is based on a Diffusion Transformer structure, integrating diffusion model capabilities with long-sequence modeling advantages [7] - A three-stage training process is implemented, progressively learning from low to high-resolution video tasks, and incorporating reinforcement learning to optimize performance across diverse tasks [9][10] Performance Evaluation - In the VBench public benchmark test, LongCat-Video scored second overall, with a notable first place in "common sense understanding" at 70.94%, outperforming several closed-source models [2][20] - The model demonstrates strong performance in visual quality and motion fluidity, although there is room for improvement in text alignment and image consistency [19][20] - LongCat-Video's visual quality score is nearly on par with Google's Veo3, indicating competitive capabilities in the video generation landscape [17][20] Future Implications - Meituan views LongCat-Video as a foundational step towards developing "world models," which could enhance its capabilities in robotics and autonomous driving [22] - The model's ability to generate realistic video content may facilitate better modeling of physical knowledge and integration with large language models in future applications [22]
「世界理解」维度看AI视频生成:Veo3和Sora2水平如何?新基准来了
量子位· 2025-10-27 08:26
Core Insights - The article discusses the significant advancements in Text-to-Video (T2V) models, particularly highlighting the recent success of Sora2 and questioning whether T2V models have achieved true "world model" capabilities [1] - A new evaluation framework called VideoVerse has been proposed to assess T2V models on their understanding of event causality, physical laws, and common sense, which are essential for a "world model" [1][3] Evaluation Framework - VideoVerse aims to evaluate T2V models based on two main perspectives: dynamic aspects (event following, mechanics, interaction, material properties, camera control) and static aspects (natural constraints, common sense, attribution correctness, 2D layout, 3D depth) [3] - Each prompt corresponds to several binary evaluation questions, with event following measured through sequence consistency using Longest Common Subsequence (LCS) [4][16] Prompt Construction - The team employs a multi-stage process to ensure the authenticity, diversity, and evaluability of prompts, sourcing data from daily life, scientific experiments, and science fiction [8][9] - Event and causal structures are extracted using advanced language models to convert natural language descriptions into event-level structures, laying the groundwork for evaluating "event following" [10][11] Evaluation Methodology - The evaluation combines QA and LCS scoring, focusing on event following, dimension-specific questions, and overall scoring that reflects both logical sequence and physical details [5][18] - The introduction of hidden semantics aims to assess whether models can generate implicit consequences that are not explicitly stated in prompts [20][22] Experimental Findings - The team evaluated various open-source and closed-source models, finding that open-source models perform comparably in basic dimensions but lag significantly in world model capabilities [28] - Even the strongest closed-source model, Sora2, shows notable deficiencies in "hidden semantics following" and certain physical/material inferences [29] Conclusion and Future Directions - VideoVerse provides a comprehensive evaluation framework aimed at shifting the focus from merely generating realistic visuals to understanding and simulating the world [40] - The team has open-sourced data, evaluation code, and a leaderboard, encouraging further research to enhance world model capabilities [41]
特斯拉世界模拟器亮相ICCV,VP亲自解密端到端自动驾驶技术路线
3 6 Ke· 2025-10-27 08:11
Core Insights - Tesla has unveiled a world simulator for generating realistic driving scenarios, which was presented by Ashok Elluswamy at the ICCV conference, emphasizing the future of intelligent driving lies in end-to-end AI [1][5][24] Group 1: World Simulator Features - The world simulator can create new challenging scenarios for autonomous driving tasks, such as vehicles suddenly changing lanes or AI navigating around pedestrians and obstacles [2] - The generated scenario videos serve dual purposes: training autonomous driving models and providing a gaming experience for human users [2][4] Group 2: End-to-End AI Approach - Elluswamy highlighted that end-to-end AI is the future of autonomous driving, utilizing data from various sensors to generate control commands for vehicles [5][8] - The end-to-end approach is contrasted with modular systems, which are easier to develop initially but lack the optimization and scalability of end-to-end systems [8][10] Group 3: Challenges and Solutions - One major challenge for end-to-end autonomous driving is evaluation, which the world simulator addresses by using a vast dataset to synthesize future states based on current conditions [11] - The complexity of real-world data, such as high frame rates and multiple sensor inputs, leads to a "curse of dimensionality," which Tesla mitigates by collecting extensive driving data to enhance model generalization [13][15] Group 4: Industry Perspectives - The industry is divided between two main approaches to end-to-end autonomous driving: VLA (Vision-Language-Action) and world models, with various companies adopting different strategies [24] - Tesla's choice of the end-to-end approach has garnered attention due to its historical success in the autonomous driving space, raising questions about the future direction of the technology [24]
美团LongCat-Video正式发布并开源 视频推理速度提升至10.1倍
Zheng Quan Ri Bao Wang· 2025-10-27 08:06
Core Insights - The LongCat team of Meituan has released and open-sourced the LongCat-Video video generation model, achieving state-of-the-art (SOTA) performance in foundational tasks of text-to-video and image-to-video generation, with significant advantages in long video generation [1][2] - The model is seen as a crucial step towards building "world models," which are essential for the next generation of artificial intelligence, allowing AI to understand and simulate the real world [1] Technical Features - LongCat-Video is based on a Diffusion Transformer architecture and supports three core tasks: text-to-video without conditional frames, image-to-video with one reference frame, and video continuation using multiple preceding frames, creating a complete task loop [2] - The model can generate stable 5-minute long videos without quality loss, addressing industry pain points such as color drift and motion discontinuity, ensuring temporal consistency and physical motion realism [2] - LongCat-Video employs a three-tier optimization strategy (C2F, BSA, and model distillation) to enhance video inference speed by 10.1 times, achieving an optimal balance between efficiency and quality [2] Performance Evaluation - The model evaluation includes both internal and public benchmark tests, covering text-to-video and image-to-video tasks, with a focus on multiple dimensions such as text alignment, image alignment, visual quality, motion quality, and overall quality [3] - LongCat-Video, with 13.6 billion parameters, has achieved SOTA performance in the open-source domain for both text-to-video and image-to-video tasks, demonstrating significant advantages in key metrics like text alignment and motion coherence [3]
马斯克「世界模拟器」首曝,1天蒸馏人类500年驾驶经验,擎天柱同脑进化
3 6 Ke· 2025-10-27 07:34
Core Insights - Tesla has unveiled its "World Simulator," a neural network that ingests 500 years of human driving experience daily to evolve in an infinite virtual environment, which can also be utilized by its humanoid robot, Optimus [1][2][3] Group 1: Technology Overview - The "World Simulator" generates various driving scenarios, including rare situations like pedestrians crossing the road and vehicles cutting in, allowing AI to simulate and test responses in a controlled environment [2][3] - Tesla employs an "end-to-end" neural network for autonomous driving, processing raw data from multiple cameras and other inputs to directly generate driving commands without separate modules for perception, prediction, and planning [6][7][9] - This approach allows the AI to learn human-like decision-making and reduces information loss between modules, enhancing overall system performance [13][16] Group 2: Data Utilization - Tesla's fleet generates a vast amount of data, equivalent to 500 years of human driving experience daily, which is filtered to extract high-quality learning samples for the AI [25][27] - The AI's ability to generalize from complex scenarios, such as predicting vehicle behavior in adverse weather, is attributed to exposure to diverse driving conditions [30] Group 3: Simulation Capabilities - The "World Simulator" can evaluate new AI models in a closed-loop environment, recreate real-world dangerous scenarios for testing, and generate extreme situations to challenge the AI's limits [46] - This simulator serves as a foundational AI engine that extends beyond automotive applications, also being applicable to Tesla's humanoid robot project, Optimus [47][48]
美团LongCat-Video视频生成模型发布:可输出5分钟长视频
Feng Huang Wang· 2025-10-27 07:32
Core Insights - Meituan officially announced the release of the LongCat-Video video generation model, which is based on the Diffusion Transformer architecture and supports three core tasks: text-to-video, image-to-video, and video continuation [1] Model Features - LongCat-Video can generate high-definition videos at 720p resolution and 30 frames per second, with the ability to create coherent video content lasting up to 5 minutes [1] - The model addresses common issues in long video generation, such as frame breaks and quality degradation, by maintaining temporal consistency and motion rationality through video continuation pre-training and block sparse attention mechanisms [1] Efficiency and Performance - The model employs two-stage generation, block sparse attention, and model distillation techniques, reportedly achieving over a 10x improvement in inference speed [1] - With a parameter count of 13.6 billion, LongCat-Video has demonstrated strong performance in text alignment and motion continuity in public tests like VBench [1] Future Applications - As part of the effort to build a "world model," LongCat-Video may find applications in scenarios requiring long-term sequence modeling, such as autonomous driving simulations and embodied intelligence [1] - The release of this model signifies a significant advancement for Meituan in the fields of video generation and physical world simulation [1]
美团视频生成模型来了!一出手就是开源SOTA
量子位· 2025-10-27 05:37
Core Viewpoint - Meituan has launched an open-source video model named LongCat-Video, which supports text-to-video and image-to-video generation, showcasing significant advancements in video generation technology [1][39]. Group 1: Model Features - LongCat-Video has 13.6 billion parameters and can generate videos lasting up to five minutes, demonstrating a strong understanding of real-world physics and semantics [1][12][39]. - The model excels in generating 720p, 30fps videos with high semantic understanding and visual presentation capabilities, ranking among the best in open-source models [18][62]. - It can maintain consistency in generated videos, addressing challenges such as detail capture and complex lighting effects [19][24]. Group 2: Technical Innovations - LongCat-Video integrates three main tasks: text-to-video, image-to-video, and video continuation, using a Diffusion Transformer framework [41]. - The model employs a unique training approach that directly pre-trains on video continuation tasks, mitigating cumulative errors in long video generation [46][48]. - It utilizes advanced techniques like block sparse attention and a from-coarse-to-fine generation paradigm to enhance video generation efficiency [52][53]. Group 3: Performance Evaluation - In internal benchmarks, LongCat-Video outperformed models like PixVerse-V5 and Wan2.2-T2V-A14B in overall quality, with strong performance in visual quality and motion quality [62][63]. - The model achieved a top score in common-sense dimensions, indicating its superior ability to model the physical world [64]. Group 4: Broader Context - This is not the first instance of Meituan venturing into AI; the company has previously released various models, including LongCat-Flash-Chat and LongCat-Flash-Thinking, showcasing its commitment to AI innovation [65][68].
特斯拉世界模拟器亮相ICCV!VP亲自解密端到端自动驾驶技术路线
量子位· 2025-10-27 05:37
Core Viewpoint - Tesla has unveiled a world simulator for autonomous driving, showcasing its potential to generate realistic driving scenarios and enhance the training of AI models for self-driving technology [1][4][12]. Group 1: World Simulator Features - The simulator can create new challenging scenarios for autonomous driving tasks, such as unexpected lane changes by other vehicles [4][5]. - It allows AI to perform driving tasks in existing scenarios, avoiding pedestrians and obstacles [7][9]. - The generated scenario videos can also serve as a gaming experience for human users [9]. Group 2: End-to-End AI Approach - Tesla's VP Ashok Elluswamy emphasized that end-to-end AI is the future of autonomous driving, applicable not only to driving but also to other intelligent scenarios like the Tesla Optimus robot [12][13][14]. - The end-to-end neural network utilizes data from various sensors to generate control commands for the vehicle, contrasting with modular systems that are easier to develop initially but less effective in the long run [17]. - The end-to-end approach allows for better optimization and handling of complex driving situations, such as navigating around obstacles [18][21]. Group 3: Challenges and Solutions - One major challenge for end-to-end autonomous driving is evaluation, which Tesla addresses with its world simulator that trains on a vast dataset [22][24]. - The simulator can also facilitate large-scale reinforcement learning, potentially surpassing human performance [24]. - Other challenges include the "curse of dimensionality," interpretability, and safety guarantees, which require processing vast amounts of data [26][27][28]. Group 4: Data Utilization - Tesla collects data equivalent to 500 years of driving every day, using a complex data engine to filter high-quality samples for training [29][30]. - This extensive data collection enhances the model's generalization capabilities to handle extreme situations [30]. Group 5: Technical Approaches in the Industry - The industry is divided between two main approaches: VLA (Vision-Language Architecture) and world models, with companies like Huawei and NIO representing the latter [38][39]. - VLA proponents argue it leverages existing internet data for better understanding, while world model advocates believe it addresses the core issues of autonomous driving [41][42]. - Tesla's approach is closely watched due to its historical success in selecting effective strategies in autonomous driving development [43][44].