Workflow
世界模型
icon
Search documents
斯坦福具身智能大佬引用,Huggingface官方催更:北京人形开源WoW具身世界模型
机器之心· 2025-10-17 11:53
机器之心发布 机器之心编辑部 如果说 GPT 系列让 AI 理解语言,Sora 系列让 AI 生成视觉世界,那么 WoW 正在尝试让 AI 建模物理世界。 在「具身智能」与「世界模型」成为新一轮 AI 竞赛关键词的当下,来自 北京人形机器人创新中心、北京大学多媒体信息处理国家重点实验室、香港科技大 学的中国团队 开源了全新的世界模型架构。 该团队提出了一个让机器真正 "看见、理解并行动于世界" 的世界模型 —— WoW(World-Omniscient World Model, 意图让 AI 学会 "做" —— 通过身 体与世界互动来学习因果与物理,致力于助力行业打造 "最好用" 的具身智能机器人。 一经发布,受到学术界产业界关注关注,其中 Huggingface 留言:"Excellent work" 催更开源,斯坦福具身智能大佬,PI 创始人 Chelsea Finn & 清华 合作文章引用 WoW 具身世界模型技术报告。 不是看图说话,而是动手理解世界:WoW 模型揭秘 真正具备物理理解的世界模型,必须建立在与现实世界广泛且因果丰富的交互与反馈之上。 人类通过与世界的主动互动,逐渐发展出对 直觉物理 的 ...
李飞飞世界模型大更新, 实时生成3D世界,只要一块GPU
3 6 Ke· 2025-10-17 08:03
Core Insights - The article discusses the launch of RTFM (Real-Time Frame Model) by The World Labs, which allows for real-time generation of interactive 3D worlds using a single H100 GPU [1][8] - RTFM distinguishes itself from other models by enabling complex visual effects and interactions from a single static image, utilizing end-to-end learning from vast video data [4][9] Group 1: Technology and Capabilities - RTFM can generate a 3D scene that users can explore in real-time, simulating realistic visual effects such as reflections and shadows [4][6] - The model operates on three core principles: efficiency, persistence, and the ability to learn from video data without explicit 3D modeling [6][11] - RTFM employs a mechanism called "spatial memory" to maintain consistency in the generated world, allowing users to revisit the environment without increasing computational load [11][13] Group 2: Market Context and Future Prospects - The technology aims to overcome significant computational challenges faced by existing models, such as Sora, which require extensive processing power for real-time video generation [6][15] - The potential for RTFM to evolve as hardware costs decrease and algorithms improve suggests a future where immersive virtual worlds could become more accessible [15]
“AI教母”李飞飞的全新世界模型问世!一张英伟达AI芯片就能生成无限3D世界
Tai Mei Ti A P P· 2025-10-17 02:53
Core Insights - World Labs, co-founded by Fei-Fei Li, has launched a new real-time generative world model called RTFM (Real-Time Frame Model) which utilizes large-scale video data for efficient end-to-end training [3][4] - RTFM can generate new 2D images from one or more 2D inputs without relying on explicit 3D representations, marking a significant advancement in AI rendering capabilities [3][4] - The model can render persistent and 3D-consistent scenes in real-time using a single NVIDIA H100 GPU, enabling interactive experiences in both real and virtual environments [4][10] Company Overview - World Labs was founded in March 2023 by Fei-Fei Li and three other scholars, focusing on developing efficient, scalable, and persistent world models [8][10] - The company raised $230 million in September 2023, achieving a valuation of $1 billion within three months of its establishment [10] - The team consists of approximately 24 members, with a significant representation of Chinese individuals [10] Technology and Innovation - RTFM addresses scalability issues that have long plagued world models, enhancing spatial intelligence in machines, which allows for better navigation and decision-making in complex 3D environments [6][7] - The model's efficiency is highlighted by its ability to support interactive frame rate inference with a single H100 GPU, while its scalability allows for continuous optimization as data and computational power grow [8][10] - Future plans include developing a large model (LWM) that comprehensively understands three-dimensional, physical, and temporal concepts, with applications in AR and robotics [10][12] Research and Development - Fei-Fei Li is also spearheading the Behavior 1K challenge, aimed at standardizing tasks in embodied intelligence and robotics research, providing a platform for training and evaluation [11][12] - The Behavior 1K challenge includes 1,000 tasks focused on long-horizon tasks in everyday environments, promoting collaboration and comparison among researchers [12] - The integration of various AI technologies is seen as a transformative moment for society, emphasizing a human-centered approach in AI development [12][13]
李飞飞团队发布世界模型最新成果
Jing Ji Guan Cha Wang· 2025-10-17 01:59
经济观察网《科创板日报》17日消息,当地时间10月16日,李飞飞宣布对外推出全新模型RTFM(A Real-Time Frame Model),不仅具备实时运行、持久性和3D一致性,单张H100GPU就能运行。 ...
李飞飞发布全新世界模型,单GPU就能跑
3 6 Ke· 2025-10-17 01:45
Core Insights - The newly launched RTFM (A Real-Time Frame Model) by Fei-Fei Li is designed to operate in real-time with persistence and 3D consistency, requiring only a single H100 GPU for operation [1][10] - RTFM is built on three core principles: efficiency, scalability, and persistence, allowing for real-time inference at interactive frame rates, continuous expansion with data and computational power, and permanent retention of all scenes [1][6] Group 1: Model Capabilities - RTFM can generate and simulate a persistent, interactive, and physically accurate world, which has the potential to transform various industries from media to robotics [3][5] - The model's efficiency allows it to perform real-time inference with just one H100 GPU, making it immediately deployable while ensuring that the virtual world remains intact during user interactions [1][6] Group 2: Technical Innovations - RTFM utilizes a novel approach by training a single neural network to generate 2D images from 2D inputs without requiring explicit 3D representations, thus simplifying the modeling process [7][8] - The model employs a self-regressive diffusion transformer architecture, trained end-to-end on vast video data, enabling it to predict subsequent frames based on historical data [7][8] Group 3: Memory and Persistence - RTFM addresses the challenge of persistence by modeling each frame with a spatial pose, allowing the model to maintain a memory of the world without the need for explicit 3D geometry [9][10] - The concept of context juggling enables the model to generate content in different spatial areas using varying contextual frames, thus maintaining a long-term memory of large worlds during extended interactions [10]
自驾行业完整的基建,更值得毕业的同学做探索!
自动驾驶之心· 2025-10-17 00:03
Core Viewpoint - The autonomous driving industry is maturing in terms of infrastructure and investment, making it a suitable field for students and professionals to explore and develop their skills [1][16]. Group 1: Industry Insights - The technology landscape in autonomous driving is consolidating, but there are still many product forms to refine, indicating ongoing opportunities for innovation [1]. - The industry is currently debating the technical routes of world models and VLA, suggesting that while theoretical aspects may be solidifying, practical implementation remains a challenge [1]. - The focus on L2 functionality and the regulatory progress for L3 indicates a gradual evolution towards more advanced levels of automation, with L4 still facing unresolved issues [1]. Group 2: Community and Learning Resources - A community called "Autonomous Driving Heart Knowledge Sphere" has been established, which integrates various resources such as videos, articles, learning paths, and job exchange, aimed at fostering collaboration and knowledge sharing [4][5]. - The community has grown to over 4,000 members, with a goal to reach nearly 10,000 in the next two years, providing a platform for both beginners and advanced learners [5]. - The community offers practical guidance on various topics, including entry points for end-to-end learning, multi-modal large models, and data annotation practices [7][8]. Group 3: Career Opportunities - The community actively shares job openings and facilitates connections between members and companies in the autonomous driving sector, enhancing employment opportunities [12][21]. - There is a focus on developing comprehensive learning paths for newcomers, ensuring they have access to a well-rounded education in autonomous driving technologies [17][38]. Group 4: Technical Development - The community has compiled over 40 technical routes and resources related to autonomous driving, covering areas such as perception, simulation, planning, and control [17][34]. - Regular discussions and live sessions with industry experts are held to explore trends, technical directions, and production challenges in autonomous driving [8][90].
工业界和学术界都在怎么搞端到端和VLA?
自动驾驶之心· 2025-10-17 00:03
Core Insights - The article discusses the evolution of end-to-end algorithms in autonomous driving, highlighting the transition from modular production algorithms to end-to-end and now to Vision-Language Alignment (VLA) models [1][3] - It emphasizes the rich technology stack involved in end-to-end algorithms, including BEV perception, visual language models (VLM), diffusion models, reinforcement learning, and world models [3] Summary by Sections End-to-End Algorithms - End-to-end algorithms are categorized into two main paradigms: single-stage and two-stage, with UniAD being a representative of the single-stage approach [1] - Single-stage can further branch into various subfields, particularly those based on VLA, which have seen a surge in related publications and industrial applications in recent years [1] Courses Offered - The article promotes two courses: "End-to-End and VLA Autonomous Driving Small Class" and "Practical Course on Autonomous Driving VLA and Large Models," aimed at helping individuals quickly and efficiently enter the field [3] - The "Practical Course" focuses on VLA, covering topics from VLM as an autonomous driving interpreter to modular and integrated VLA, along with detailed theoretical foundations [3][12] Instructor Team - The instructor team includes experts from both academia and industry, with backgrounds in multi-modal perception, autonomous driving VLA, and large model frameworks [8][11][14] - Notable instructors have published numerous papers in top-tier conferences and have extensive experience in research and practical applications in autonomous driving and large models [8][11][14] Target Audience - The courses are designed for individuals with a foundational understanding of autonomous driving, familiar with basic modules, and have knowledge of transformer models, reinforcement learning, and BEV perception [15][17]
蔚小理智驾部门“大换血”:技术路线转向世界模型,智能化下半场突围战承压
3 6 Ke· 2025-10-16 07:33
Core Insights - The competition logic in the Chinese automotive market is shifting as the penetration rate of electrification is expected to exceed 50% by 2025, with electrification determining the lower limit and intelligence determining the upper limit for automakers [1] - The three leading new forces, NIO, Xpeng, and Li Auto, are undergoing significant personnel changes in their autonomous driving departments, indicating a fundamental shift in their technical strategies in response to traditional automakers' acceleration [1][2] Group 1: Strategic Adjustments - Xpeng has seen notable personnel changes, including the departure of key figures and the hiring of new leaders from Alibaba and Cruise, reflecting a strong emphasis on transformation [2][4] - NIO is facing a complex situation with both structural reorganization and core talent loss, merging teams to form a larger model team aimed at integrating general AI technology [4][11] - Li Auto's adjustments are characterized by a reduction in team size and a shift from high-precision maps to a hybrid model combining VLA and world models, achieving over 90% success in specific scenarios [5][11] Group 2: Industry Trends - The collective adjustments of these companies point to a consensus that traditional modular autonomous driving solutions have reached a bottleneck, with world models being essential for achieving L3/L4 capabilities [7] - Traditional automakers and tech companies are intensifying competition, with several traditional brands rapidly advancing their autonomous driving technologies and gaining market recognition [8][10] - The financial burden of R&D in autonomous driving and AI is significant, with NIO projected to spend 13.04 billion yuan on R&D in 2024, while Xpeng faces delays in its self-developed chips [10][11] Group 3: Competitive Landscape - The competitive landscape is becoming increasingly crowded, with traditional automakers leveraging their scale and resources to catch up with new forces, while tech giants like Huawei are establishing technological barriers [8][10] - NIO, Xpeng, and Li Auto are adopting differentiated strategies to maintain their first-mover advantages, with Xpeng focusing on cloud-based models and NIO pursuing a dual approach of self-development and partnerships [11] - The race for intelligent driving is intensifying, with the ability to convert technological advancements into user experience and profitability becoming crucial for success in the market [11]
AI与机器人盘前速递丨马斯克旗下xAI公司构建“世界模型”;新益昌正式发布机器人!
Mei Ri Jing Ji Xin Wen· 2025-10-15 01:11
Market Review - On October 14, the market opened lower and rebounded slightly, with the three major indices experiencing minor declines and a majority of stocks falling [1] - The Huaxia AI ETF (589010) saw a significant drop, closing at 1.432 yuan, down 3.83%, with a trading volume of approximately 2.41 billion yuan and 1.67 billion shares, indicating concentrated short-term selling pressure [1] - Among the 30 constituent stocks, only one rose, while the rest showed a clear downtrend, with key stocks like Chipone Technology and Rainbow Soft Technology leading the declines [1] - The Robot ETF (562500) also experienced a substantial pullback, closing at 1.009 yuan, down 4.09%, with a trading volume of 18.25 billion yuan and over 17.7 billion shares traded, reflecting intense capital competition and concentrated selling pressure [1] - Only one of the 73 constituent stocks rose, with major declines in stocks like Double Ring Transmission and Mingzhi Electric, all falling over 6% [1] - The ETFs broke through multiple moving average supports, indicating a potential phase of adjustment [1] Hot News - On October 12, it was reported that Elon Musk's xAI is accelerating the development of its "world model" to compete with Meta and Google in next-generation AI systems, focusing on autonomous navigation and design [2] - xAI has recruited experts from Nvidia to aid in this development, with gaming and robotics as initial application areas for validating the world model [2] - On the same day, New Yichang announced the launch of its humanoid robot HOSON-Robot, marking a strategic focus on humanoid robotics and establishing a regular R&D iteration mechanism [2] - On October 10, Amazon Web Services launched the Agentic AI application, Amazon Quick Suite, aimed at enhancing employee efficiency and automating tasks across applications [2] Institutional Views - CITIC Construction Investment Securities maintains a positive outlook on the sector, highlighting the upcoming third-generation product launch by Tesla after two years, which is expected to clarify the outlook for next year [3] - The domestic supply chain is anticipated to see continuous catalysts from capital operations, order shipments, and scenario implementations in the second half of the year, suggesting investment opportunities in the sector [3]
复旦SeerDrive:一种轨迹规划和场景演化的双向建模端到端框架
自动驾驶之心· 2025-10-14 23:33
Core Insights - The article discusses the advancements in end-to-end autonomous driving, specifically focusing on the SeerDrive model, which aims to improve trajectory planning by incorporating bidirectional modeling of trajectory planning and scene evolution [1][3][4]. Group 1: SeerDrive Overview - SeerDrive introduces a bidirectional modeling paradigm that captures scene dynamics while allowing planning results to optimize scene predictions, creating a closed-loop iteration [3][4]. - The overall pipeline of SeerDrive consists of four main modules: feature encoding, future BEV world modeling, future perception planning, and iterative optimization [4]. Group 2: Challenges in Current Systems - Current one-shot paradigms in autonomous driving overlook dynamic scene evolution, leading to inaccurate planning in complex interactions [5]. - Existing systems fail to model the impact of vehicle behavior on the surrounding environment, which is crucial for accurate trajectory planning [5]. Group 3: Technical Components - Feature encoding transforms multimodal sensor inputs and vehicle states into structured features, laying the groundwork for subsequent modeling [8][9]. - Future BEV world modeling predicts scene dynamics by generating future BEV features, balancing efficiency and structured representation [10][13]. Group 4: Planning and Optimization - SeerDrive employs a decoupled strategy for planning, allowing current and future scenes to guide planning separately, thus avoiding representation entanglement [15]. - The iterative optimization process enhances the bidirectional dependency between trajectory planning and scene evolution, leading to improved performance [17]. Group 5: Experimental Results - SeerDrive achieved a PDMS score of 88.9 on the NAVSIM test set, outperforming several state-of-the-art methods [23]. - In the nuScenes validation set, SeerDrive demonstrated an average L2 displacement error of 0.43m and a collision rate of 0.06%, significantly better than competing methods [24]. Group 6: Component Effectiveness - The removal of future perception planning or iterative optimization resulted in a decrease in PDMS scores, indicating the importance of these components for performance enhancement [26]. - The design choices, such as the decoupled strategy and the use of anchored endpoints for future ego feature initialization, proved to be critical for achieving optimal results [30]. Group 7: Limitations and Future Directions - The BEV world model does not leverage the generalization capabilities of foundational models, which could enhance performance in complex scenarios [41]. - Future research may explore the integration of foundational models with planning to improve generalization while maintaining efficiency [41].