Workflow
世界模型
icon
Search documents
华为、蔚来重金押注WA世界模型!这才是未来辅助驾驶的发展方向?
电动车公社· 2025-10-03 15:58
Core Viewpoint - The article discusses the WA (World Action) model in the context of autonomous driving technology, contrasting it with the VLA (Vision-Language Action) model, highlighting their respective advantages and applications in the industry [4][62]. Summary by Sections Introduction to WA Model - The WA model is gaining traction in the autonomous driving sector, with companies like Huawei and NIO publicly endorsing this approach [6][30]. - The concept of the WA model has historical roots dating back to the 1940s, originating from the idea of "mental models" proposed by psychologist Kenneth Craik [9][11]. Mechanism of WA Model - The WA model allows machines to interpret the physical world by simulating a "small world model" that helps in decision-making based on sensory information [12][29]. - The model has evolved with advancements in AI, particularly after the introduction of techniques like "dream training" by DeepMind in 2018, which compresses real-world scenarios into data for predictive modeling [17][26]. Comparison with VLA Model - The WA model is characterized by its strong analytical capabilities regarding the laws of motion in the physical world, enabling it to predict driving scenarios effectively [31][32]. - NIO claims that the WA model can analyze driving data from the last 3 seconds and simulate conditions for up to 120 seconds in just 0.1 seconds, generating 216 possible scenarios [32][33]. - The WA model incorporates a "pre-judgment" phase, enhancing its response speed compared to traditional end-to-end models [34][35]. Advantages of WA Model - The WA model offers higher interpretability and lower latency, making it more effective in specific hazardous scenarios compared to the VLA model [60]. - It can simulate extreme collision scenarios in a virtual environment, allowing for extensive data generation for model training, which is crucial for improving the system's response to rare events [51][52]. - The model's architecture is designed to use less computational power at the vehicle level, optimizing performance during critical situations [54][59]. Long-term Outlook - The article suggests that while the WA and VLA models currently represent distinct paths in autonomous driving technology, there is potential for future integration or the emergence of new architectures that could unify their strengths [71].
Sim2Real,解不了具身智能的数据困境。
自动驾驶之心· 2025-10-03 03:32
Core Viewpoint - The article discusses the ongoing debate in the field of embodied intelligence regarding the reliance on simulation efficiency versus real-world data, and the potential of world models to redefine the landscape of data utilization in this domain [4][8]. Group 1: Understanding Sim-to-Real Gap - The "Sim-to-Real gap" refers to the discrepancies between simulated environments and real-world scenarios, primarily due to incomplete simulations that fail to accurately replicate visual and physical details [8]. - Research indicates that the gap exists because simulation models do not fully capture the complexities of the real world, leading to limited generalization capabilities and a focus on specific scenarios [8][11]. - Solutions to bridge this gap involve optimizing data, including designing virtual and real data ratios and leveraging AIGC to generate diverse datasets that balance volume and authenticity [11][12]. Group 2: Data Utilization in Embodied Intelligence - There is a consensus among experts that while real data is ideal for training, the current landscape necessitates a reliance on simulation data due to the scarcity of high-quality real-world datasets in the embodied intelligence field [20][21]. - Simulation data plays a crucial role in foundational model iteration and testing, as it allows for safe and efficient algorithm testing before deploying on real machines [21][24]. - The potential of simulation in scaling reinforcement learning is highlighted, as well-constructed simulators can facilitate large-scale parallel training, enabling models to learn from scenarios that are difficult to capture in real life [24][26]. Group 3: World Models and Future Directions - The article emphasizes the significance of world models in future research, particularly in areas like autonomous driving and embodied intelligence, showcasing their potential in general visual understanding and long-term planning [30][32]. - Challenges remain in automating the generation of simulation data and ensuring the diversity and generalization of actions within simulations, which are critical for advancing the field [28][29]. - The introduction of new modalities, such as force and touch, into world models is suggested as a promising direction for future research, despite current limitations in computational resources [30][31]. Group 4: Reaction to Boston Dynamics Technology - Experts acknowledge the advanced capabilities of Boston Dynamics robots, particularly their smooth execution of complex tasks that require sophisticated motion control [33][37]. - The discussion highlights the importance of hardware and data in the field of embodied intelligence, with Boston Dynamics' approach serving as a benchmark for future developments [37][39]. - The consensus is that the seamless performance of these robots is attributed not only to hardware differences but also to superior motion control techniques that could inform future research in embodied intelligence [39][41].
最新世界模型!WorldSplat:用于自动驾驶的高斯中心前馈4D场景生成(小米&南开)
自动驾驶之心· 2025-10-02 03:04
Core Insights - The article introduces WorldSplat, a novel feedforward framework that integrates generative methods with explicit 3D reconstruction for 4D driving scene synthesis, addressing the challenges of generating controllable and realistic driving scene videos [5][36]. Background Review - Generating controllable and realistic driving scene videos is a core challenge in autonomous driving and computer vision, crucial for scalable training and closed-loop evaluation [5]. - Existing generative models have made progress in high-fidelity, user-customized video generation, reducing reliance on expensive real data, while urban scene reconstruction methods have optimized 3D representation and consistency for new view synthesis [5][6]. - Despite advancements, generative and reconstruction methods face challenges in creating unknown environments and synthesizing new views, with existing video generation models often lacking 3D consistency and controllability [5][6]. WorldSplat Framework - WorldSplat combines generative diffusion with explicit 3D reconstruction, constructing dynamic 4D Gaussian representations that can render new views along any user-defined camera trajectory without scene-by-scene optimization [6][10]. - The framework consists of three key modules: a 4D perception latent diffusion model for multimodal latent variable generation, a latent Gaussian decoder for feedforward 4D Gaussian prediction and real-time trajectory rendering, and an enhanced diffusion model for video quality optimization [10][12]. Algorithm Details - The 4D perception latent diffusion model generates multimodal latent variables containing RGB, depth, and dynamic target information based on user-defined control conditions [14][15]. - The latent Gaussian decoder predicts pixel-aligned 3D Gaussian distributions, separating static backgrounds from dynamic targets to create a unified 4D representation [20][21]. - The enhanced diffusion model optimizes the rendered RGB video based on both the original input and the rendered video, enriching spatial details and enhancing temporal coherence [24][27]. Experimental Results - Extensive experiments demonstrate that WorldSplat achieves state-of-the-art performance in generating high-fidelity, temporally consistent free-view videos, significantly benefiting downstream driving tasks [12][36]. - Comparative results show that WorldSplat outperforms existing generative and reconstruction techniques in terms of realism and new view quality [31][32]. Conclusion - The proposed WorldSplat framework effectively integrates generative and reconstruction methods, enabling the generation of explicit 4D Gaussian distributions optimized for high-fidelity, temporally and spatially consistent multi-trajectory driving videos [36].
梦里啥都有?谷歌新世界模型纯靠「想象」训练,学会了在《我的世界》里挖钻石
机器之心· 2025-10-02 01:30
Core Insights - Google DeepMind's Dreamer 4 supports the idea that agents can learn skills for interacting with the physical world through imagination without direct interaction [2][4] - Dreamer 4 is the first agent to obtain diamonds in the challenging game Minecraft solely from standard offline datasets, demonstrating significant advancements in offline learning [7][21] Group 1: World Model and Training - World models enable agents to understand the world deeply and select successful actions by predicting future outcomes from their perspective [4] - Dreamer 4 utilizes a novel shortcut forcing objective and an efficient Transformer architecture to accurately learn complex object interactions while allowing real-time human interaction on a single GPU [11][19] - The model can be trained on large amounts of unlabeled video data, requiring only a small amount of action-paired video, opening possibilities for learning general world knowledge from diverse online videos [13] Group 2: Experimental Results - In the offline diamond challenge, Dreamer 4 significantly outperformed OpenAI's offline agent VPT15, achieving success with 100 times less data [22] - Dreamer 4's performance in acquiring key items and the time taken to obtain them surpassed behavior cloning methods, indicating that world model representations are superior for decision-making [24] - The agent demonstrated a high success rate in various tasks, achieving 14 out of 16 successful interactions in the Minecraft environment, showcasing its robust capabilities [29] Group 3: Action Generation - Dreamer 4 achieved a PSNR of 53% and SSIM of 75% with only 10 hours of action training, indicating that the world model absorbs most knowledge from unlabeled videos with minimal action data [32]
Sim,Real还是World Model?具身智能数据的“困境”与解法
具身智能之心· 2025-10-01 12:48
Core Viewpoint - The article discusses the ongoing debate in the field of embodied intelligence regarding the reliance on simulation efficiency versus real-world data, and the potential of world models to bridge the gap between these two approaches [2]. Group 1: Understanding Sim-to-Real Gap - The "Sim-to-Real gap" refers to the discrepancies between simulated environments and real-world scenarios, primarily due to incomplete simulations that fail to accurately replicate visual and physical details [3]. - Key factors contributing to this gap include limited simulation data, which weakens model generalization and restricts adaptability to specific scenarios [3]. - To narrow this gap, optimization around data is essential, including designing virtual and real data ratios based on model requirements and leveraging AIGC to generate diverse and realistic data [3]. Group 2: Data Utilization in Embodied Intelligence - There is a consensus among experts that while real data is ideal for training, simulation data plays a crucial role in the foundational model iteration and testing phases [15][18]. - Real data is often limited in the field of embodied intelligence, making it challenging to meet the high expectations for diverse task performance [15]. - Simulation data is currently seen as a necessary resource, especially for testing algorithms and avoiding potential damages in real-world experiments [15][18]. Group 3: Future Directions and Challenges - The development of world models is viewed as a promising direction for the future of embodied intelligence, with potential applications in autonomous driving and other areas [25]. - Key challenges include the need for automated generation of simulation data and enhancing the diversity of actions within simulation environments [21][23]. - The integration of new modalities, such as force and touch, into world models is suggested as a valuable research direction [23]. Group 4: Reaction to Boston Dynamics Technology - Experts acknowledge the advanced capabilities of Boston Dynamics robots, particularly their smooth execution of complex tasks involving full-body movements [26][30]. - The discussion highlights the importance of hardware and data in achieving high performance in embodied intelligence systems, with Boston Dynamics setting a benchmark in the field [30]. - The need for further exploration in motion control techniques to enhance the fluidity of robotic movements is emphasized [32].
有人在自驾里面盲目内卷,而有的人在搭建真正的壁垒...
自动驾驶之心· 2025-09-29 23:33
Core Viewpoint - The automotive industry is undergoing a significant transformation, with numerous executive changes and a focus on advanced technologies such as autonomous driving and artificial intelligence [1][3]. Group 1: Industry Changes - In September, 48 executives in the automotive sector underwent changes, indicating a shift in leadership and strategy [1]. - Companies like Li Auto and BYD are restructuring their teams to enhance their capabilities in autonomous driving and cockpit technology [1]. - The industry is witnessing a rapid evolution in algorithm development, moving from BEV to more complex models like VLA and world models [1][3]. Group 2: Autonomous Driving Focus - The forefront of autonomous driving technology is centered on VLA/VLM, end-to-end driving, world models, and reinforcement learning [3]. - There is a notable gap in understanding the industry's actual progress among students and mid-sized companies, highlighting the need for better communication between academia and industry [3]. Group 3: Community and Knowledge Sharing - A community called "Autonomous Driving Heart Knowledge Planet" has been established to bridge the gap between academic and industrial knowledge, aiming to grow to nearly 10,000 members in two years [5]. - The community offers a comprehensive platform for learning, including video content, Q&A, and job exchange, catering to both beginners and advanced learners [6][10]. - Members can access over 40 technical routes and engage with industry leaders to discuss trends and challenges in autonomous driving [6][8]. Group 4: Learning Resources - The community provides various resources for practical questions related to autonomous driving, such as entry points for end-to-end systems and data annotation practices [6][11]. - A detailed curriculum is available for newcomers, covering essential topics in autonomous driving technology [20][21]. - The platform also includes job referral mechanisms to connect members with potential employers in the autonomous driving sector [13][14].
华尔街见闻早餐FM-Radio|2025年9月30日
Sou Hu Cai Jing· 2025-09-29 23:27
Market Overview - Technology stocks supported the three major US stock indices, which rose for two consecutive days to a one-week high, with Nvidia up over 2% and Micron up over 4% [1] - The US Treasury bonds saw a rise, with the ten-year yield declining for the first time in four days [1] - Bitcoin surged nearly $4,000, surpassing the $114,000 mark, while Ethereum rebounded over 4% [1] - Crude oil prices fell over 3%, marking the largest drop in three months, with WTI down over 4% [1] - Gold prices hit a historical high, with spot gold rising nearly 2% to break the $3,800 mark for the first time [1] Key News - The Central Committee of the Communist Party of China held a meeting to discuss documents to be submitted for review at the 20th Central Committee's Fourth Plenary Session [11] - The National Development and Reform Commission announced a new policy financial tool with a total scale of 500 billion yuan, aimed at supporting private enterprises' deep participation in the "Artificial Intelligence +" initiative [11][12] Company News - Facing competition from the iPhone 17, analyst Guo Mingqi lowered the shipment target for Xiaomi 17 by 20% [17] - Anthropic launched Claude Sonnet 4.5, claiming it to be the "best coding model globally" [17][23] - OpenAI plans to launch Sora 2, an independent app that defaults to using copyrighted content, which has sparked controversy [17] Industry Insights - The A-share market is experiencing a bull market characterized by high volume, moderate enthusiasm, and distinct structural features, with no clear bubble signals [18] - The semiconductor industry is seeing significant developments, with Shenzhen's new semiconductor company attracting external investors [15] - The education sector is undergoing transformation due to digital technology and AI, with a focus on enhancing digital education services [24]
金融时报:超级智能的下一个入口,谷歌、Meta、英伟达......科技巨头都在加码“世界模型”
美股IPO· 2025-09-29 08:51
Core Viewpoint - Major AI companies like Google DeepMind, Meta, and Nvidia are shifting their R&D focus towards "world models" to gain an edge in the race towards machine "superintelligence" [1][3][7] Group 1: Market Potential - The potential market size for "world models" is estimated to be as high as $100 trillion, encompassing sectors such as autonomous driving, robotics, and manufacturing [1][3][4] Group 2: Technological Developments - Recent advancements in "world models" have been highlighted by various AI companies, with Google DeepMind releasing Genie 3, which generates video frame by frame, allowing for scalable AI training without real-world consequences [5] - Meta is training its V-JEPA model using raw video content to mimic children's passive learning through observation, with ongoing tests on robots [5] - Nvidia's CEO has stated that the next major growth phase for the company will come from "physical AI," leveraging its Omniverse platform for simulations to support expansion into robotics [5] Group 3: Applications and Innovations - "World models" are being applied in the entertainment industry, with startups like World Labs developing models that generate 3D environments from single images, and Runway creating game scenes that better understand physical laws [6] Group 4: Industry Challenges - The shift towards "world models" is driven by the perception that large language models (LLMs) are reaching their performance ceiling, with significant investments from major companies [7][8] - Despite the promising outlook, building these models requires vast amounts of physical world data and computational power, which remains a significant technical challenge [9] - Experts believe that achieving human-level intelligence in machines driven by next-generation AI systems may still take up to a decade [9]
工业界大佬带队!三个月搞定端到端自动驾驶
自动驾驶之心· 2025-09-29 08:45
Core Viewpoint - 2023 is identified as the year of end-to-end production, with 2024 expected to be a significant year for this development in the automotive industry, particularly in autonomous driving technology [1][3]. Group 1: End-to-End Production - Leading new forces and manufacturers have already achieved end-to-end production [1]. - There are two main paradigms in the industry: one-stage and two-stage approaches, with UniAD being a representative of the one-stage method [1]. Group 2: Development Trends - Since last year, the one-stage end-to-end approach has rapidly evolved, leading to various derivatives such as perception-based, world model-based, diffusion model-based, and VLA-based one-stage methods [3]. - Major autonomous driving companies are focusing on self-research and mass production of end-to-end autonomous driving solutions [3]. Group 3: Course Offerings - A course titled "End-to-End and VLA Autonomous Driving" has been launched, covering cutting-edge algorithms in both one-stage and two-stage end-to-end approaches [5]. - The course aims to provide insights into the latest technologies in the field, including BEV perception, visual language models, diffusion models, and reinforcement learning [5]. Group 4: Course Structure - The course consists of several chapters, starting with an introduction to end-to-end algorithms, followed by background knowledge essential for understanding the technology stack [9][10]. - The second chapter focuses on the most frequently asked technical keywords in job interviews over the next two years [10]. - Subsequent chapters delve into two-stage end-to-end methods, one-stage end-to-end methods, and practical assignments involving RLHF fine-tuning [12][13]. Group 5: Learning Outcomes - Upon completion, participants are expected to reach a level equivalent to one year of experience as an end-to-end autonomous driving algorithm engineer [19]. - The course aims to deepen understanding of key technologies such as BEV perception, multimodal large models, and reinforcement learning, enabling participants to apply learned concepts to real projects [19].
AI下一轮飞跃的引爆点:“世界模型”
财联社· 2025-09-29 08:44
Core Insights - The article emphasizes the critical role of world models in advancing artificial intelligence towards achieving Artificial General Intelligence (AGI) [3][4] - It highlights the growing interest and investment in world models, with over 10 players in the field in China alone, indicating a significant trend in AI development [3] Group 1: Importance of World Models - World models are essential for enhancing AI's spatial reasoning capabilities, allowing for better interaction with the physical world [4][5] - The integration of multimodal data through world models is seen as foundational for physical reasoning and simulating future states, which is crucial for achieving human-like intelligence [5][6] Group 2: Current Developments and Applications - Companies like Meta and Google DeepMind are actively developing systems that utilize world models to improve AI performance in real-world simulations [3][9] - Tesla and Waabi are examples of companies embedding world models in their AI systems for autonomous driving, showcasing practical applications of this technology [10] Group 3: Challenges and Limitations - Current AI systems primarily rely on probability models and struggle with logical reasoning, which world models aim to address [6][7] - The complexity of the real world presents challenges that necessitate advanced simulations for AI training, as demonstrated by DeepMind's Genie 3 project [9]