Workflow
世界模型
icon
Search documents
美团LongCat团队发布并开源LongCat-Video视频生成模型
Xin Lang Cai Jing· 2025-10-27 05:24
Core Insights - The LongCat team from Meituan has released and open-sourced the LongCat-Video video generation model, achieving state-of-the-art (SOTA) performance in foundational tasks for text-to-video and image-to-video generation [1] - The model enables coherent long video generation in minutes, ensuring temporal consistency across frames and physical motion realism, providing a significant advantage in the long video generation field [1] - The release of the video generation model is seen as a first step towards exploring "world models," with future applications anticipated in autonomous driving and embodied intelligence, enhancing the connection between the "digital world" and the "physical world" [1]
美团开源LongCat-Video支持高效长视频生成,迈出“世界模型”探索第一步
Jing Ji Guan Cha Wang· 2025-10-27 04:01
Core Insights - Meituan has taken a significant step towards developing a "World Model" by launching the LongCat-Video video generation model, aiming to better connect the "atomic world" and the "bit world" [1][2] Group 1: LongCat-Video Model Features - The LongCat-Video model is based on the Diffusion Transformer (DiT) architecture and supports three core tasks: text-to-video, image-to-video, and video continuation, forming a complete task loop without the need for additional model adaptation [5] - The model can generate coherent long videos of up to 5 minutes without quality loss, addressing industry pain points such as color drift and motion discontinuity, ensuring temporal consistency and physical motion rationality [5][6] - LongCat-Video has achieved a video inference speed improvement of 10.1 times through a three-tier optimization approach, balancing efficiency and quality [6] Group 2: Performance and Evaluation - LongCat-Video has reached state-of-the-art (SOTA) performance in open-source video generation tasks, with a comprehensive evaluation covering text alignment, image alignment, visual quality, motion quality, and overall quality [5][9] - The model has 13.6 billion parameters and demonstrates significant advantages in key metrics such as text-video alignment and motion continuity, performing exceptionally well in public benchmark tests like VBench [9]
视频推理速度提升至10.1倍!美团 LongCat-Video正式发布并开源
Xin Lang Ke Ji· 2025-10-27 02:36
Core Insights - Meituan's LongCat team has released and open-sourced the LongCat-Video model, achieving state-of-the-art (SOTA) performance in video generation tasks based on text and images [1] - The model enables coherent generation of long videos at minute-level duration, ensuring temporal consistency across frames and physical motion realism, marking a significant advancement in long video generation [1] - The concept of "World Model" is highlighted as a key engine for next-generation AI, allowing systems to understand, predict, and reconstruct the real world [1] Group 1 - The LongCat-Video model is seen as a crucial step towards exploring "World Models," which can model physical laws, spatiotemporal evolution, and scene logic [1] - Video generation models are positioned as a key pathway for building World Models, compressing various forms of knowledge such as geometry, semantics, and physics [1] - The LongCat model is expected to integrate into Meituan's ongoing investments in autonomous driving and embodied intelligence, enhancing the connection between the digital and physical worlds [1]
精读DeepSeek OCR论文,我远远看到了「世界模型」的轮廓
Tai Mei Ti A P P· 2025-10-27 02:34
Core Insights - DeepSeek OCR is a notable OCR model but is considered overhyped compared to leading models in the field [1] - The model's performance in specific tasks, such as mathematical formula recognition and table structure identification, is subpar compared to smaller models like PaddleOCR-VL [2][5] - DeepSeek's approach to visual token compression is innovative, aiming to explore the boundaries of visual-text compression [14][15] Model Performance Comparison - DeepSeek OCR has a parameter size of 3 billion and achieves an accuracy of 86.46% with a compression ratio of 10-12 times, maintaining around 90% accuracy [10][14] - In contrast, PaddleOCR-VL, with only 0.9 billion parameters, outperforms DeepSeek in specific tasks [2][5] - Other models like MinerU2.5 and dots.ocr also show higher performance metrics in various tasks [2] Innovation and Research Direction - DeepSeek emphasizes a biological-inspired forgetting mechanism for compression, where recent context is kept high-resolution while older context is progressively blurred [12][11] - The research indicates that optical context compression is not only technically feasible but also biologically reasonable, providing a new perspective for long-context modeling [14][15] - The model's findings suggest a shift in focus from language-based models to visual-based models, potentially leading to breakthroughs in AI research [20][22] Industry Context - DeepSeek represents a unique case in the Chinese tech landscape, where it combines a romantic idealism for technology with practical applications, diverging from typical profit-driven models [6] - The company is seen as a rare entity that prioritizes exploration of advanced technologies over immediate commercial success [6] - The insights from DeepSeek's research could redefine how AI systems process information, moving towards a more visual-centric approach [20][21]
LeCun怒揭机器人最大骗局,坦白Llama与我无瓜
3 6 Ke· 2025-10-26 09:22
Core Insights - The core argument presented by Yann LeCun is that the humanoid robotics industry lacks a clear path to achieving general intelligence, emphasizing the need for breakthroughs in AI to create truly intelligent robots capable of understanding and interacting with the physical world [1][21]. Group 1: Challenges in Humanoid Robotics - LeCun asserts that current humanoid robots are limited to narrow tasks and cannot perform complex household activities, highlighting a significant gap between narrow intelligence and general intelligence [1]. - The development of a "world model" architecture is crucial for enabling robots to learn, understand, and predict physical systems, which is currently a major challenge in the industry [1][21]. - Many companies in the humanoid robotics space are reportedly unaware of how to make their robots sufficiently intelligent for practical applications, which could jeopardize their future valuations [21]. Group 2: Industry Reactions - Tesla's Optimus AI lead, Julian Ibarz, publicly disagrees with LeCun's views, indicating that Tesla has a clear strategy for achieving general humanoid robotics [1]. - Brett Adcock, CEO of Figure AI, challenges LeCun to engage more practically in the field, expressing confidence that their humanoid robot will be able to perform tasks in unfamiliar environments by next year [3][23]. - The industry is divided, with some leaders advocating for aggressive timelines while others, like LeCun, emphasize the need for foundational advancements in AI [22][23]. Group 3: The Concept of World Models - LeCun defines a "world model" as a system that can predict the outcomes of actions based on the current state of the environment, which is essential for planning and executing tasks [15][18]. - He argues that the current reliance on large language models (LLMs) is insufficient for achieving human-level intelligence, as they primarily rely on low-bandwidth data sources like text [15][16]. - The development of world models could allow robots to learn from simulated or real-world data without needing extensive retraining for specific tasks, marking a shift towards self-supervised learning [18][19]. Group 4: Future Directions - LeCun predicts that within the next 3-5 years, world models will become a mainstream component of AI architecture, fundamentally changing the approach to humanoid robotics [20]. - Companies like 1X Technologies are aligning their research with LeCun's vision of world models, indicating a potential shift in the industry towards more practical and effective AI solutions [33]. - The competition in humanoid robotics may ultimately favor those who can successfully address the challenge of machine understanding of the physical world, rather than those who merely produce impressive demonstrations [37].
从世界模型到VLA再到强化,具身大小脑算法原来是这样的!
具身智能之心· 2025-10-26 04:02
Core Insights - The article discusses the evolution and current state of embodied intelligence, focusing on the roles of the brain and cerebellum in robotics, where the brain handles perception and planning, while the cerebellum is responsible for execution [3][10]. Technical Evolution - The development of embodied intelligence has progressed through several stages, starting from grasp pose detection, moving to behavior cloning, and now advancing to diffusion policy and VLA models [7][10]. - The first stage focused on static object grasping with limited decision-making capabilities [7]. - The second stage introduced behavior cloning, allowing robots to learn from expert demonstrations but faced challenges in generalization and error accumulation [8]. - The third stage, marked by the introduction of diffusion policy, improved stability and generalization by modeling action sequences [8]. - The fourth stage, emerging in 2025, explores the integration of VLA models with reinforcement learning and world models to enhance robots' predictive and interactive capabilities [9][10]. Current Trends and Applications - The integration of VLA with reinforcement learning enhances robots' trial-and-error learning and self-improvement abilities, while the combination with world models allows for future prediction and better planning [10]. - The article highlights the growing demand for embodied intelligence applications across various sectors, including industrial, home, restaurant, and medical rehabilitation, leading to increased job opportunities and research interest in the field [10]. Educational Initiatives - The article outlines a structured learning program aimed at equipping individuals with comprehensive knowledge of embodied intelligence algorithms, including practical applications and real-world projects [11][14]. - The course targets individuals with a foundational understanding of embodied intelligence and aims to bridge the gap between theoretical knowledge and practical deployment [18][24].
Tesla终于分享点东西了,世界模型和闭环评测都强的可怕......
自动驾驶之心· 2025-10-25 16:03
Core Insights - Tesla has shared insights into its architecture, emphasizing the use of a large model and extensive data, which allows for a fixed computation time and high-frequency actions in its Full Self-Driving (FSD) system [5][6]. Group 1: Reasons for End-to-End Approach - The complexity of human driving behavior makes it difficult to define a single evaluation function, leading to challenges in rule-based optimization [8]. - The interface definition between perception, prediction, and planning is problematic, resulting in information loss [8]. - An end-to-end approach is better suited for scalability and addressing long-tail problems [8]. - Fixed computation time based on neural networks reduces latency compared to traditional methods [8]. - Philosophically, reliance on computational power and data is preferred over human experience [8]. Group 2: Challenges of End-to-End Systems - The three main challenges faced by end-to-end systems include evaluation, the curse of dimensionality, and ensuring interpretability and safety [19][20]. - The curse of dimensionality leads to insufficient supervisory signals when transitioning from high-dimensional to low-dimensional spaces [21]. - Ensuring interpretability and safety is crucial, as the model must genuinely understand driving behavior rather than just fitting shortcuts [23]. Group 3: Evaluation Challenges - High-quality datasets cannot solely describe performance through loss metrics, indicating a need for more comprehensive evaluation methods [39]. - Open-loop evaluations cannot replace closed-loop assessments, highlighting the necessity for real-world testing [39]. - Driving behavior is multimodal, requiring evaluation metrics that encompass various driving actions [39]. - One proposed method involves predicting the consequences of actions, potentially using a critic to assess model performance [39]. - Balancing the evaluation dataset is essential for accurate assessments [39]. Group 4: World Model Simulator - Tesla introduced a world model simulator that generates subsequent videos based on real scenarios, indicating a high barrier to entry for this technology [41]. - The simulator allows for replaying previous issues to assess improvements, akin to two-stage simulations [44]. - This technology can also be applied to humanoid robots, enabling reinforcement training and simulation [46].
VLA/世界模型/WA/端到端是宣传分歧, 不是技术路线分歧
理想TOP2· 2025-10-25 05:21
Core Viewpoints - Many people are unaware that there is no universally accepted definition of VLA/world model/end-to-end [1] - Leading autonomous driving companies share more commonalities in their exploration of autonomous driving than the differences portrayed online, with the core being promotional divergence rather than technical route divergence [1][2] - Language plays a significant role in autonomous driving, particularly in long reasoning, user interaction value alignment, and understanding the world [1] - Those who believe that predicting the next token is more than just a probability distribution are more likely to accept that language can understand the world [1] Group 1: VLA/World Model/End-to-End - VLA, world model, and end-to-end all require the ability to generate road video data that appears real, focusing on visual information input and ultimately controlling vehicle actions [2] - The distinction lies in the involvement of language, its depth of participation, and the architectural form it takes, with future language-related tokens potentially being LLM's text tokens or photon tokens [2] - The narrative that VLA and world models represent different technical routes is misleading, as both need to generate a world model and understand the physical world [4] Group 2: End-to-End Definitions - The definition of end-to-end is often debated, with some believing it requires a core framework where input and output are clearly defined [5] - Tesla's approach, which involves visual input and outputting trajectory rather than direct control signals, raises questions about the true nature of their end-to-end definition [5][6] - The output of precise trajectories is preferred over direct control signals, suggesting a more effective design approach [6] Group 3: Tesla's Approach and Future Directions - Tesla's historical context and style suggest that their approach to end-to-end definitions may not have a universally accepted exclusivity [7] - Long-term predictions indicate that AI model inputs and outputs may predominantly involve photons, which could significantly reduce computational loads [10] - The ideal VLA model is defined as having visual or multimodal input, language participation, and ultimately directing actions in a broad sense [11] Group 4: Understanding Language and AI Potential - There are fundamental differences in views regarding LLM, particularly concerning the understanding of predicting the next token [12] - Those who see predicting the next token as more than mere statistics are more inclined to recognize the potential of LLM and AI [12][19] - The ability to predict the next token effectively implies an understanding of the underlying reality that generates the token, which is a deeper question than it appears [18]
CVPR 2026倒计时Day21,冲这个方向简直降维打击!
自动驾驶之心· 2025-10-24 16:03
Core Viewpoint - The article emphasizes the importance of targeted guidance and mentorship for students aiming to publish high-quality research papers in top conferences like CVPR and ICRA, highlighting the need for strategic focus in the final stages of the submission process [2][3]. Group 1: Submission Insights - The current submission volume for CVPR 2026 has exceeded 2000, indicating a competitive landscape similar to ICLR [1]. - Historical trends show that successful submissions often focus on specific breakthroughs and verifiable improvements rather than broad themes, aligning closely with the main topics of the conference [1]. - The anticipated main theme for CVPR 2026 is likely to revolve around "world models," suggesting a strategic direction for potential submissions [1]. Group 2: Mentorship and Guidance - The organization offers specialized mentorship programs aimed at helping students navigate the complexities of research paper writing and submission, particularly for those in the fields of autonomous driving and AI [2][3]. - With over 300 dedicated instructors from top global universities, the organization provides a wealth of academic resources and expertise to assist students in producing high-quality research [3]. - The mentorship program includes personalized guidance through the entire research process, from topic selection to submission, ensuring that students are well-prepared for the rigorous demands of top-tier conferences [11]. Group 3: Student Support and Outcomes - The organization addresses common challenges faced by students, such as lack of guidance, fragmented knowledge, and difficulties in understanding the research process [5]. - Students are encouraged to develop a systematic understanding of both classic and cutting-edge algorithms, enhancing their practical skills and research capabilities [5]. - Successful participants in the program may receive recommendations from prestigious institutions and direct job placements in leading tech companies, emphasizing the program's potential impact on students' academic and professional trajectories [16].
自动驾驶之心合伙人招募!
自动驾驶之心· 2025-10-24 16:03
Group 1 - The article announces the recruitment of 10 outstanding partners for the autonomous driving sector, focusing on course development, paper guidance, and hardware research [2] - The main areas of expertise sought include large models, multimodal models, diffusion models, end-to-end systems, embodied interaction, joint prediction, SLAM, 3D object detection, world models, closed-loop simulation, and model deployment and quantization [3] - Candidates are preferred from QS200 universities with a master's degree or higher, especially those with significant contributions to top conferences [4] Group 2 - The compensation package includes resource sharing for job seeking, doctoral studies, and overseas study recommendations, along with substantial cash incentives and opportunities for entrepreneurial project collaboration [5] - Interested parties are encouraged to add WeChat for consultation, specifying "organization/company + autonomous driving cooperation inquiry" [6]