世界模型
Search documents
美团LongCat-Video正式发布并开源 视频推理速度提升至10.1倍
Zheng Quan Ri Bao Wang· 2025-10-27 08:06
Core Insights - The LongCat team of Meituan has released and open-sourced the LongCat-Video video generation model, achieving state-of-the-art (SOTA) performance in foundational tasks of text-to-video and image-to-video generation, with significant advantages in long video generation [1][2] - The model is seen as a crucial step towards building "world models," which are essential for the next generation of artificial intelligence, allowing AI to understand and simulate the real world [1] Technical Features - LongCat-Video is based on a Diffusion Transformer architecture and supports three core tasks: text-to-video without conditional frames, image-to-video with one reference frame, and video continuation using multiple preceding frames, creating a complete task loop [2] - The model can generate stable 5-minute long videos without quality loss, addressing industry pain points such as color drift and motion discontinuity, ensuring temporal consistency and physical motion realism [2] - LongCat-Video employs a three-tier optimization strategy (C2F, BSA, and model distillation) to enhance video inference speed by 10.1 times, achieving an optimal balance between efficiency and quality [2] Performance Evaluation - The model evaluation includes both internal and public benchmark tests, covering text-to-video and image-to-video tasks, with a focus on multiple dimensions such as text alignment, image alignment, visual quality, motion quality, and overall quality [3] - LongCat-Video, with 13.6 billion parameters, has achieved SOTA performance in the open-source domain for both text-to-video and image-to-video tasks, demonstrating significant advantages in key metrics like text alignment and motion coherence [3]
马斯克「世界模拟器」首曝,1天蒸馏人类500年驾驶经验,擎天柱同脑进化
3 6 Ke· 2025-10-27 07:34
Core Insights - Tesla has unveiled its "World Simulator," a neural network that ingests 500 years of human driving experience daily to evolve in an infinite virtual environment, which can also be utilized by its humanoid robot, Optimus [1][2][3] Group 1: Technology Overview - The "World Simulator" generates various driving scenarios, including rare situations like pedestrians crossing the road and vehicles cutting in, allowing AI to simulate and test responses in a controlled environment [2][3] - Tesla employs an "end-to-end" neural network for autonomous driving, processing raw data from multiple cameras and other inputs to directly generate driving commands without separate modules for perception, prediction, and planning [6][7][9] - This approach allows the AI to learn human-like decision-making and reduces information loss between modules, enhancing overall system performance [13][16] Group 2: Data Utilization - Tesla's fleet generates a vast amount of data, equivalent to 500 years of human driving experience daily, which is filtered to extract high-quality learning samples for the AI [25][27] - The AI's ability to generalize from complex scenarios, such as predicting vehicle behavior in adverse weather, is attributed to exposure to diverse driving conditions [30] Group 3: Simulation Capabilities - The "World Simulator" can evaluate new AI models in a closed-loop environment, recreate real-world dangerous scenarios for testing, and generate extreme situations to challenge the AI's limits [46] - This simulator serves as a foundational AI engine that extends beyond automotive applications, also being applicable to Tesla's humanoid robot project, Optimus [47][48]
美团LongCat-Video视频生成模型发布:可输出5分钟长视频
Feng Huang Wang· 2025-10-27 07:32
Core Insights - Meituan officially announced the release of the LongCat-Video video generation model, which is based on the Diffusion Transformer architecture and supports three core tasks: text-to-video, image-to-video, and video continuation [1] Model Features - LongCat-Video can generate high-definition videos at 720p resolution and 30 frames per second, with the ability to create coherent video content lasting up to 5 minutes [1] - The model addresses common issues in long video generation, such as frame breaks and quality degradation, by maintaining temporal consistency and motion rationality through video continuation pre-training and block sparse attention mechanisms [1] Efficiency and Performance - The model employs two-stage generation, block sparse attention, and model distillation techniques, reportedly achieving over a 10x improvement in inference speed [1] - With a parameter count of 13.6 billion, LongCat-Video has demonstrated strong performance in text alignment and motion continuity in public tests like VBench [1] Future Applications - As part of the effort to build a "world model," LongCat-Video may find applications in scenarios requiring long-term sequence modeling, such as autonomous driving simulations and embodied intelligence [1] - The release of this model signifies a significant advancement for Meituan in the fields of video generation and physical world simulation [1]
美团视频生成模型来了!一出手就是开源SOTA
量子位· 2025-10-27 05:37
Core Viewpoint - Meituan has launched an open-source video model named LongCat-Video, which supports text-to-video and image-to-video generation, showcasing significant advancements in video generation technology [1][39]. Group 1: Model Features - LongCat-Video has 13.6 billion parameters and can generate videos lasting up to five minutes, demonstrating a strong understanding of real-world physics and semantics [1][12][39]. - The model excels in generating 720p, 30fps videos with high semantic understanding and visual presentation capabilities, ranking among the best in open-source models [18][62]. - It can maintain consistency in generated videos, addressing challenges such as detail capture and complex lighting effects [19][24]. Group 2: Technical Innovations - LongCat-Video integrates three main tasks: text-to-video, image-to-video, and video continuation, using a Diffusion Transformer framework [41]. - The model employs a unique training approach that directly pre-trains on video continuation tasks, mitigating cumulative errors in long video generation [46][48]. - It utilizes advanced techniques like block sparse attention and a from-coarse-to-fine generation paradigm to enhance video generation efficiency [52][53]. Group 3: Performance Evaluation - In internal benchmarks, LongCat-Video outperformed models like PixVerse-V5 and Wan2.2-T2V-A14B in overall quality, with strong performance in visual quality and motion quality [62][63]. - The model achieved a top score in common-sense dimensions, indicating its superior ability to model the physical world [64]. Group 4: Broader Context - This is not the first instance of Meituan venturing into AI; the company has previously released various models, including LongCat-Flash-Chat and LongCat-Flash-Thinking, showcasing its commitment to AI innovation [65][68].
特斯拉世界模拟器亮相ICCV!VP亲自解密端到端自动驾驶技术路线
量子位· 2025-10-27 05:37
Core Viewpoint - Tesla has unveiled a world simulator for autonomous driving, showcasing its potential to generate realistic driving scenarios and enhance the training of AI models for self-driving technology [1][4][12]. Group 1: World Simulator Features - The simulator can create new challenging scenarios for autonomous driving tasks, such as unexpected lane changes by other vehicles [4][5]. - It allows AI to perform driving tasks in existing scenarios, avoiding pedestrians and obstacles [7][9]. - The generated scenario videos can also serve as a gaming experience for human users [9]. Group 2: End-to-End AI Approach - Tesla's VP Ashok Elluswamy emphasized that end-to-end AI is the future of autonomous driving, applicable not only to driving but also to other intelligent scenarios like the Tesla Optimus robot [12][13][14]. - The end-to-end neural network utilizes data from various sensors to generate control commands for the vehicle, contrasting with modular systems that are easier to develop initially but less effective in the long run [17]. - The end-to-end approach allows for better optimization and handling of complex driving situations, such as navigating around obstacles [18][21]. Group 3: Challenges and Solutions - One major challenge for end-to-end autonomous driving is evaluation, which Tesla addresses with its world simulator that trains on a vast dataset [22][24]. - The simulator can also facilitate large-scale reinforcement learning, potentially surpassing human performance [24]. - Other challenges include the "curse of dimensionality," interpretability, and safety guarantees, which require processing vast amounts of data [26][27][28]. Group 4: Data Utilization - Tesla collects data equivalent to 500 years of driving every day, using a complex data engine to filter high-quality samples for training [29][30]. - This extensive data collection enhances the model's generalization capabilities to handle extreme situations [30]. Group 5: Technical Approaches in the Industry - The industry is divided between two main approaches: VLA (Vision-Language Architecture) and world models, with companies like Huawei and NIO representing the latter [38][39]. - VLA proponents argue it leverages existing internet data for better understanding, while world model advocates believe it addresses the core issues of autonomous driving [41][42]. - Tesla's approach is closely watched due to its historical success in selecting effective strategies in autonomous driving development [43][44].
美团LongCat团队发布并开源LongCat-Video视频生成模型
Xin Lang Cai Jing· 2025-10-27 05:24
Core Insights - The LongCat team from Meituan has released and open-sourced the LongCat-Video video generation model, achieving state-of-the-art (SOTA) performance in foundational tasks for text-to-video and image-to-video generation [1] - The model enables coherent long video generation in minutes, ensuring temporal consistency across frames and physical motion realism, providing a significant advantage in the long video generation field [1] - The release of the video generation model is seen as a first step towards exploring "world models," with future applications anticipated in autonomous driving and embodied intelligence, enhancing the connection between the "digital world" and the "physical world" [1]
美团开源LongCat-Video支持高效长视频生成,迈出“世界模型”探索第一步
Jing Ji Guan Cha Wang· 2025-10-27 04:01
Core Insights - Meituan has taken a significant step towards developing a "World Model" by launching the LongCat-Video video generation model, aiming to better connect the "atomic world" and the "bit world" [1][2] Group 1: LongCat-Video Model Features - The LongCat-Video model is based on the Diffusion Transformer (DiT) architecture and supports three core tasks: text-to-video, image-to-video, and video continuation, forming a complete task loop without the need for additional model adaptation [5] - The model can generate coherent long videos of up to 5 minutes without quality loss, addressing industry pain points such as color drift and motion discontinuity, ensuring temporal consistency and physical motion rationality [5][6] - LongCat-Video has achieved a video inference speed improvement of 10.1 times through a three-tier optimization approach, balancing efficiency and quality [6] Group 2: Performance and Evaluation - LongCat-Video has reached state-of-the-art (SOTA) performance in open-source video generation tasks, with a comprehensive evaluation covering text alignment, image alignment, visual quality, motion quality, and overall quality [5][9] - The model has 13.6 billion parameters and demonstrates significant advantages in key metrics such as text-video alignment and motion continuity, performing exceptionally well in public benchmark tests like VBench [9]
视频推理速度提升至10.1倍!美团 LongCat-Video正式发布并开源
Xin Lang Ke Ji· 2025-10-27 02:36
Core Insights - Meituan's LongCat team has released and open-sourced the LongCat-Video model, achieving state-of-the-art (SOTA) performance in video generation tasks based on text and images [1] - The model enables coherent generation of long videos at minute-level duration, ensuring temporal consistency across frames and physical motion realism, marking a significant advancement in long video generation [1] - The concept of "World Model" is highlighted as a key engine for next-generation AI, allowing systems to understand, predict, and reconstruct the real world [1] Group 1 - The LongCat-Video model is seen as a crucial step towards exploring "World Models," which can model physical laws, spatiotemporal evolution, and scene logic [1] - Video generation models are positioned as a key pathway for building World Models, compressing various forms of knowledge such as geometry, semantics, and physics [1] - The LongCat model is expected to integrate into Meituan's ongoing investments in autonomous driving and embodied intelligence, enhancing the connection between the digital and physical worlds [1]
精读DeepSeek OCR论文,我远远看到了「世界模型」的轮廓
Tai Mei Ti A P P· 2025-10-27 02:34
Core Insights - DeepSeek OCR is a notable OCR model but is considered overhyped compared to leading models in the field [1] - The model's performance in specific tasks, such as mathematical formula recognition and table structure identification, is subpar compared to smaller models like PaddleOCR-VL [2][5] - DeepSeek's approach to visual token compression is innovative, aiming to explore the boundaries of visual-text compression [14][15] Model Performance Comparison - DeepSeek OCR has a parameter size of 3 billion and achieves an accuracy of 86.46% with a compression ratio of 10-12 times, maintaining around 90% accuracy [10][14] - In contrast, PaddleOCR-VL, with only 0.9 billion parameters, outperforms DeepSeek in specific tasks [2][5] - Other models like MinerU2.5 and dots.ocr also show higher performance metrics in various tasks [2] Innovation and Research Direction - DeepSeek emphasizes a biological-inspired forgetting mechanism for compression, where recent context is kept high-resolution while older context is progressively blurred [12][11] - The research indicates that optical context compression is not only technically feasible but also biologically reasonable, providing a new perspective for long-context modeling [14][15] - The model's findings suggest a shift in focus from language-based models to visual-based models, potentially leading to breakthroughs in AI research [20][22] Industry Context - DeepSeek represents a unique case in the Chinese tech landscape, where it combines a romantic idealism for technology with practical applications, diverging from typical profit-driven models [6] - The company is seen as a rare entity that prioritizes exploration of advanced technologies over immediate commercial success [6] - The insights from DeepSeek's research could redefine how AI systems process information, moving towards a more visual-centric approach [20][21]
LeCun怒揭机器人最大骗局,坦白Llama与我无瓜
3 6 Ke· 2025-10-26 09:22
Core Insights - The core argument presented by Yann LeCun is that the humanoid robotics industry lacks a clear path to achieving general intelligence, emphasizing the need for breakthroughs in AI to create truly intelligent robots capable of understanding and interacting with the physical world [1][21]. Group 1: Challenges in Humanoid Robotics - LeCun asserts that current humanoid robots are limited to narrow tasks and cannot perform complex household activities, highlighting a significant gap between narrow intelligence and general intelligence [1]. - The development of a "world model" architecture is crucial for enabling robots to learn, understand, and predict physical systems, which is currently a major challenge in the industry [1][21]. - Many companies in the humanoid robotics space are reportedly unaware of how to make their robots sufficiently intelligent for practical applications, which could jeopardize their future valuations [21]. Group 2: Industry Reactions - Tesla's Optimus AI lead, Julian Ibarz, publicly disagrees with LeCun's views, indicating that Tesla has a clear strategy for achieving general humanoid robotics [1]. - Brett Adcock, CEO of Figure AI, challenges LeCun to engage more practically in the field, expressing confidence that their humanoid robot will be able to perform tasks in unfamiliar environments by next year [3][23]. - The industry is divided, with some leaders advocating for aggressive timelines while others, like LeCun, emphasize the need for foundational advancements in AI [22][23]. Group 3: The Concept of World Models - LeCun defines a "world model" as a system that can predict the outcomes of actions based on the current state of the environment, which is essential for planning and executing tasks [15][18]. - He argues that the current reliance on large language models (LLMs) is insufficient for achieving human-level intelligence, as they primarily rely on low-bandwidth data sources like text [15][16]. - The development of world models could allow robots to learn from simulated or real-world data without needing extensive retraining for specific tasks, marking a shift towards self-supervised learning [18][19]. Group 4: Future Directions - LeCun predicts that within the next 3-5 years, world models will become a mainstream component of AI architecture, fundamentally changing the approach to humanoid robotics [20]. - Companies like 1X Technologies are aligning their research with LeCun's vision of world models, indicating a potential shift in the industry towards more practical and effective AI solutions [33]. - The competition in humanoid robotics may ultimately favor those who can successfully address the challenge of machine understanding of the physical world, rather than those who merely produce impressive demonstrations [37].