世界模型
Search documents
世界模型==VQA?机器人不用想象画面,预测语义就够了
机器之心· 2025-10-28 00:41
Core Insights - The article discusses the necessity of precise future predictions in world models for AI, questioning whether detailed visual representations are essential for decision-making [1][6] - It introduces the concept of the Semantic World Model (SWM), which focuses on predicting semantic information about future outcomes rather than generating visual frames [9][18] Summary by Sections World Models and Their Limitations - World models enable AI to learn the dynamics of the world and predict future events based on current states [6] - Traditional models often generate realistic images but may miss critical semantic details necessary for decision-making [7][8] Semantic World Model (SWM) - SWM redefines world modeling as a visual question-answering (VQA) problem, focusing on task-relevant interactions rather than raw visual data [8][9] - SWM utilizes a visual language model (VLM) to answer questions about future actions and their semantic effects [9][11] Training and Data Generation - SWM can be trained on low-quality sequence data, including both expert and non-expert data, making it versatile [15] - A dataset called SAQA (State-Action-Question-Answer) is generated to train the model effectively [22] Experimental Results - SWM demonstrated high accuracy in answering future outcome questions and showed generalization capabilities in new scenarios [17] - In multi-task simulations, SWM significantly improved performance compared to baseline models, achieving success rates of 81.6% in LangTable and 76% in OGBench [30][34] Generalization and Robustness - SWM retains the generalization capabilities of the underlying VLM, showing improvements in performance even with new object combinations and background changes [39][41] - The model's attention mechanisms focus on task-relevant information, indicating its ability to generalize across different scenarios [41]
郑智化就“连滚带爬”表述致歉;春秋航空招聘已婚已育空嫂;宗馥莉心腹祝丽丹离职;安徽成汽车产量第一省;长安汽车一4S店起火丨邦早报
创业邦· 2025-10-28 00:10
Group 1 - Zhu Lidan, the legal representative of Hongsheng Group controlled by Zong Fuli, has left the company, with her office now assigned to Kou Jing [3][4] - Zhu Lidan has been a core member of Hongsheng Group and has had a long-standing working relationship with Zong Fuli [3] - There are reports that Zhu Lidan was summoned by relevant authorities twice since September, and her position was previously marked as "to be determined" [4] Group 2 - Changan Automobile confirmed a fire incident at a 4S store in Anhui, but no information on the cause of the fire has been provided [6] - Meituan announced a nationwide rollout of pension insurance subsidies for delivery riders starting in November, marking the first such scheme available to all riders [12][13] - Spring Airlines has launched a recruitment campaign for "air sisters," targeting married women with children and expanding the age limit to 40 years [13] Group 3 - JD.com has been granted an insurance brokerage license in Hong Kong, marking its entry into the financial market [13] - Tesla's board chair warned that if Elon Musk's $1 trillion compensation plan is not approved, the company may face significant value loss [13] - High-profile education company Gaotu is under investigation for allegedly organizing illegal offline subject training in Beijing [13] Group 4 - Amazon plans to invest over €1.4 billion in the Netherlands over the next three years, the largest investment commitment since entering the market [14] - Porsche responded to reports of multiple gasoline vehicle discontinuations, clarifying that the fuel version of Macan is not included in the changes [15] - AI startup Mercor raised $350 million at a valuation of $10 billion, with participation from notable investors [15][16] Group 5 - The global mobile game in-app purchase revenue is expected to increase by 6% to $85.4 billion by 2025 [20] - China is projected to generate over 400 million discarded mobile phones annually, with low recycling prices and privacy concerns hindering recovery efforts [20] - Anhui has become the top province in automobile production, with 15 provinces expected to produce over one million vehicles this year [20]
今年CVPR,自动驾驶还能冲什么方向?
自动驾驶之心· 2025-10-28 00:03
Core Viewpoint - The article emphasizes the importance of targeted guidance and mentorship for students aiming to publish high-quality papers in top conferences like CVPR and ICLR, highlighting the need for strategic efforts in the final stages of the submission process [1][2][4]. Group 1: Submission Guidance - The article mentions that the majority of accepted papers in past conferences focus on localized breakthroughs and verifiable improvements, aligning closely with the main themes of the respective years [1]. - It suggests that the main theme for CVPR 2026 is likely to be "world models," indicating a strategic direction for potential submissions [1]. - The article encourages students to leverage the experiences of predecessors to enhance their submission quality, particularly in the final stages of preparation [2]. Group 2: Mentorship and Support - The organization, "Automated Driving Heart," is described as the largest AI technology media platform in China, with extensive academic resources and a deep understanding of the challenges in interdisciplinary fields like autonomous driving and robotics [3]. - The article highlights the success rate of their mentorship program, with a 96% acceptance rate for students over the past three years, indicating the effectiveness of their guidance [5]. - It outlines the personalized support provided, including assistance with research thinking, familiarization with research processes, and practical application of theoretical models [7][13]. Group 3: Program Structure and Offerings - The article details the structured support offered, including personalized paper guidance, real-time interaction with mentors, and unlimited access to recorded sessions for review [13]. - It specifies that the program caters to various academic levels and goals, from foundational courses for beginners to advanced mentorship for experienced researchers [17][19]. - The organization also provides opportunities for outstanding students, such as recommendations to prestigious institutions and direct referrals to leading tech companies [19].
TeraSim World:用开源方式重建「特斯拉式」世界模型
自动驾驶之心· 2025-10-28 00:03
Core Viewpoint - Tesla has showcased its internal World Model, a neural network-driven virtual world generator that synthesizes high-resolution videos from eight camera perspectives based on vehicle states and control inputs, enabling real-time environmental predictions and closed-loop validation [2][6]. Group 1: Tesla's World Model - Tesla's World Model allows for the replay of historical problem scenarios and the injection of new adversarial events in a virtual environment for testing and reinforcement learning [2]. - The model learns a general mapping of "perception-action-world change," making it applicable to other platforms like robotics, thus forming a basis for general physical intelligence [2]. Group 2: TeraSim World Framework - A research team from the University of Michigan, SaferDrive AI, the University of Hong Kong, and Tsinghua University has developed TeraSim World, an open-source framework that achieves similar generation and evaluation capabilities as Tesla's World Model without requiring real maps or sensor backgrounds [5][6]. - TeraSim World is designed to automatically generate city environments and traffic behaviors using AI, creating a fully data-driven, reproducible, and scalable world model platform [5]. Group 3: System Features - TeraSim World features a modular, fully automated data synthesis pipeline for generating realistic and safety-critical data for end-to-end autonomous driving [7]. - The system retrieves real-world road maps and converts them into simulation-ready formats, allowing for the automatic generation of digital maps based on user input [10][11]. - It can simulate realistic traffic conditions by automatically obtaining real-time traffic data, thus reflecting local traffic patterns [13]. Group 4: Agent and Sensor Simulation - The agent simulation component enables virtual vehicles, pedestrians, and cyclists to behave like their real-world counterparts, incorporating human driving characteristics [16]. - TeraSim World introduces safety-critical scenarios based on real-world accident probabilities, ensuring the generated events are both risky and realistic [17]. - The sensor simulation aspect generates realistic camera inputs and can be extended to other sensor types, utilizing NVIDIA's open-source Cosmos models for high-resolution, time-synchronized multi-view video generation [19][22][25]. Group 5: Automated Stress Testing - TeraSim World supports automated full-stack stress testing, generating and validating various risk scenarios to assess the stability and safety boundaries of autonomous driving systems [30]. - The framework can inject dynamic and static risks, such as sudden stops or environmental changes, to evaluate system responses under diverse conditions [30]. Group 6: Conclusion and Future Plans - TeraSim World combines agent and sensor simulation to provide a comprehensive data generation process for training and testing autonomous driving systems without the need for real-world data collection [31]. - The system aims to create a large-scale synthetic driving dataset and expand to multi-modal sensor simulations, establishing an open virtual testing ground for researchers and developers [32].
Efficiency Law, 世界模型引擎驱动的具身智能学习新范式
具身智能之心· 2025-10-28 00:02
Core Insights - The article emphasizes the importance of addressing data generation issues in the field of embodied intelligence, highlighting that the previously overlooked data problems are fundamental to the successful implementation of this technology [2][5]. Group 1: Efficiency Law and Scaling Law - The article introduces the concept of "Efficiency Law," which is derived from the limitations of the "Scaling Law" in embodied intelligence. The Efficiency Law posits that the performance of embodied models is significantly influenced by the rate of high-quality data generation (r_D) within a limited timeframe [5][6]. - It is stated that a higher data generation rate (r_D) can enhance learning efficiency, while a lower rate leads to a "data scarcity zone," hindering model performance [6][20]. Group 2: World Models and Physical Accuracy - The necessity for absolute physical accuracy in world models is discussed, as embodied intelligence relies on understanding real-world physics to execute actions effectively. Models must adhere to physical laws to ensure reliable learning and decision-making [9][12]. - Current video-based world models are criticized for lacking physical correctness, as they primarily focus on visual realism rather than accurately simulating physical dynamics [8][12]. Group 3: GS-World and Its Applications - The GS-World model is presented as a novel approach that integrates generative models with physical simulation engines, allowing for the generation of physically accurate environments and interactions. This model addresses the shortcomings of traditional video-based models [11][13]. - GS-World is positioned as a transformative engine for embodied intelligence, enabling the autonomous generation of training data and facilitating high-fidelity strategy validation in simulated environments [15][20]. Group 4: Engine-Driven Learning Paradigm - The article outlines a shift from data-driven to engine-driven learning paradigms in embodied intelligence, where the GS-World engine allows for continuous interaction and feedback, fostering a self-evolving learning system [24][25]. - This new paradigm emphasizes the importance of generating and simulating physical worlds, enabling agents to learn and adapt through real-time interactions rather than relying solely on historical data [24][28]. Group 5: Robustness and Generalization - The need for embodied intelligence systems to achieve product-level success rates and robustness against environmental disturbances is highlighted. The engine-driven learning paradigm is deemed essential for developing reliable and trustworthy intelligent products [27][29]. - The GS-World model is described as a critical platform for evolving robotic skills, allowing for the natural emergence of skills through interaction within a physically accurate simulated environment [31][32].
发布并开源视频生成模型,美团在AI赛道潜行
Bei Jing Shang Bao· 2025-10-27 12:33
Core Insights - Meituan is advancing in the large model sector while facing fierce competition in the food delivery market, recently releasing and open-sourcing the LongCat-Video model, which can stably generate long videos of up to 5 minutes [2][4] - The company has made significant progress in large models, having released three major models since September, including LongCat-Flash-Chat and LongCat-Flash-Thinking, both achieving state-of-the-art (SOTA) performance in various tasks [3][8] - Meituan's strategic shift from "Food+Platform" to "Retail+Technology" emphasizes AI, robotics, and autonomous driving as core future directions, integrating these technologies into its business operations [7][8] Model Developments - The LongCat-Flash-Chat model features a mixture-of-experts architecture with 560 billion parameters, optimizing both computational efficiency and performance [3] - LongCat-Flash-Thinking has achieved SOTA in reasoning tasks across multiple domains, showcasing the company's commitment to advancing AI capabilities [3] - LongCat-Video is designed for coherent long video generation, demonstrating significant advantages in video generation tasks compared to competitors [4][5] Industry Perspective - Industry peers have mixed reactions to Meituan's advancements in video generation, with some expressing skepticism about the significance of achieving SOTA in a largely closed-source field [5][6] - The LongCat models are seen as a response to Meituan's internal content needs and potential applications in embodied intelligence [5][6] Strategic Vision - Meituan's LongCat team views the video generation model as a step towards exploring "world models," aiming to bridge the digital and physical worlds through advanced AI technologies [7] - The company's AI strategy includes enhancing employee efficiency, transforming existing products with AI, and developing proprietary large models, with a notable increase in API usage from 10% to 68% [8]
美团发布并开源视频生成模型:部分参数比肩谷歌最先进模型Veo3
Guan Cha Zhe Wang· 2025-10-27 10:52
Core Insights - Meituan's LongCat team has released and open-sourced the LongCat-Video model, achieving state-of-the-art (SOTA) performance in video generation tasks based on text and images [1][3]. Group 1: Model Features - LongCat-Video can generate coherent videos up to 5 minutes long, addressing common issues like frame drift and color inconsistency found in other models [3][6]. - The model supports 720p resolution and 30 frames per second, utilizing mechanisms like video continuation pre-training and block sparse attention to maintain temporal consistency and visual stability [6][9]. - LongCat-Video's inference speed has been enhanced by 10.1 times through a combination of two-stage coarse-to-fine generation, block sparse attention, and model distillation [6][8]. Group 2: Evaluation and Performance - In internal evaluations, LongCat-Video was assessed on text alignment, visual quality, motion quality, and overall performance, with a high correlation of 0.92 between human and automated evaluations [8][12]. - The model's visual quality score is nearly on par with Google's Veo3, surpassing other models like PixVerse-V5 and Wan2.2 in overall quality [8][12]. - LongCat-Video scored 70.94% in commonsense understanding, ranking first among open-source models, with an overall score of 62.11%, trailing only behind proprietary models like Veo3 and Vidu Q1 [12]. Group 3: Future Applications - The release of LongCat-Video is a significant step for Meituan towards building "world models," which are essential for simulating physical laws and scene logic in AI [3][13]. - Future applications may include autonomous driving simulations and embodied intelligence, where long-sequence modeling is crucial [13].
美团首个视频大模型开源,速度暴涨900%
3 6 Ke· 2025-10-27 09:13
Core Insights - Meituan has launched its first video generation model, LongCat-Video, designed for multi-task video generation, supporting text-to-video, image-to-video, and video continuation capabilities [1][2] - LongCat-Video addresses the challenge of generating long videos, natively supporting outputs of up to 5 minutes, while maintaining high temporal consistency and visual stability [1] - The model significantly enhances inference efficiency, achieving a speed increase of over 900% by employing a two-stage generation strategy and block sparse attention mechanisms [1][10][13] Model Features - LongCat-Video utilizes a unified task framework that allows it to handle three types of video generation tasks within a single model, reducing complexity and enhancing performance [9][10] - The model architecture is based on a Diffusion Transformer structure, integrating diffusion model capabilities with long-sequence modeling advantages [7] - A three-stage training process is implemented, progressively learning from low to high-resolution video tasks, and incorporating reinforcement learning to optimize performance across diverse tasks [9][10] Performance Evaluation - In the VBench public benchmark test, LongCat-Video scored second overall, with a notable first place in "common sense understanding" at 70.94%, outperforming several closed-source models [2][20] - The model demonstrates strong performance in visual quality and motion fluidity, although there is room for improvement in text alignment and image consistency [19][20] - LongCat-Video's visual quality score is nearly on par with Google's Veo3, indicating competitive capabilities in the video generation landscape [17][20] Future Implications - Meituan views LongCat-Video as a foundational step towards developing "world models," which could enhance its capabilities in robotics and autonomous driving [22] - The model's ability to generate realistic video content may facilitate better modeling of physical knowledge and integration with large language models in future applications [22]
「世界理解」维度看AI视频生成:Veo3和Sora2水平如何?新基准来了
量子位· 2025-10-27 08:26
Core Insights - The article discusses the significant advancements in Text-to-Video (T2V) models, particularly highlighting the recent success of Sora2 and questioning whether T2V models have achieved true "world model" capabilities [1] - A new evaluation framework called VideoVerse has been proposed to assess T2V models on their understanding of event causality, physical laws, and common sense, which are essential for a "world model" [1][3] Evaluation Framework - VideoVerse aims to evaluate T2V models based on two main perspectives: dynamic aspects (event following, mechanics, interaction, material properties, camera control) and static aspects (natural constraints, common sense, attribution correctness, 2D layout, 3D depth) [3] - Each prompt corresponds to several binary evaluation questions, with event following measured through sequence consistency using Longest Common Subsequence (LCS) [4][16] Prompt Construction - The team employs a multi-stage process to ensure the authenticity, diversity, and evaluability of prompts, sourcing data from daily life, scientific experiments, and science fiction [8][9] - Event and causal structures are extracted using advanced language models to convert natural language descriptions into event-level structures, laying the groundwork for evaluating "event following" [10][11] Evaluation Methodology - The evaluation combines QA and LCS scoring, focusing on event following, dimension-specific questions, and overall scoring that reflects both logical sequence and physical details [5][18] - The introduction of hidden semantics aims to assess whether models can generate implicit consequences that are not explicitly stated in prompts [20][22] Experimental Findings - The team evaluated various open-source and closed-source models, finding that open-source models perform comparably in basic dimensions but lag significantly in world model capabilities [28] - Even the strongest closed-source model, Sora2, shows notable deficiencies in "hidden semantics following" and certain physical/material inferences [29] Conclusion and Future Directions - VideoVerse provides a comprehensive evaluation framework aimed at shifting the focus from merely generating realistic visuals to understanding and simulating the world [40] - The team has open-sourced data, evaluation code, and a leaderboard, encouraging further research to enhance world model capabilities [41]
特斯拉世界模拟器亮相ICCV,VP亲自解密端到端自动驾驶技术路线
3 6 Ke· 2025-10-27 08:11
Core Insights - Tesla has unveiled a world simulator for generating realistic driving scenarios, which was presented by Ashok Elluswamy at the ICCV conference, emphasizing the future of intelligent driving lies in end-to-end AI [1][5][24] Group 1: World Simulator Features - The world simulator can create new challenging scenarios for autonomous driving tasks, such as vehicles suddenly changing lanes or AI navigating around pedestrians and obstacles [2] - The generated scenario videos serve dual purposes: training autonomous driving models and providing a gaming experience for human users [2][4] Group 2: End-to-End AI Approach - Elluswamy highlighted that end-to-end AI is the future of autonomous driving, utilizing data from various sensors to generate control commands for vehicles [5][8] - The end-to-end approach is contrasted with modular systems, which are easier to develop initially but lack the optimization and scalability of end-to-end systems [8][10] Group 3: Challenges and Solutions - One major challenge for end-to-end autonomous driving is evaluation, which the world simulator addresses by using a vast dataset to synthesize future states based on current conditions [11] - The complexity of real-world data, such as high frame rates and multiple sensor inputs, leads to a "curse of dimensionality," which Tesla mitigates by collecting extensive driving data to enhance model generalization [13][15] Group 4: Industry Perspectives - The industry is divided between two main approaches to end-to-end autonomous driving: VLA (Vision-Language-Action) and world models, with various companies adopting different strategies [24] - Tesla's choice of the end-to-end approach has garnered attention due to its historical success in the autonomous driving space, raising questions about the future direction of the technology [24]