World Model
Search documents
DeepMind科学家揭秘Genie 3:自回归架构如何让AI建构整个世界 | Jinqiu Select
锦秋集· 2025-08-06 09:07
Core Viewpoint - Google DeepMind has introduced Genie 3, a revolutionary general world model capable of generating highly interactive 3D environments from text prompts or images, supporting real-time interaction and dynamic modifications [1][2]. Group 1: Breakthrough Technology - Genie 3 is described as a "paradigm-shifting" AI technology that could unlock a trillion-dollar commercial landscape and potentially become a "killer application" in the virtual reality (VR) sector [9]. - The technology integrates features of traditional game engines, physics simulators, and video generation models, creating a real-time interactive world model [9]. Group 2: Evolution of World Models - The construction of virtual worlds has evolved from manual coding methods, exemplified by the 1996 Quake engine, to AI-generated models that learn from vast amounts of real-world video data [10]. - The ultimate goal is to generate any desired interactive world from a simple text prompt, providing diverse environments for AI training [10]. Group 3: Genie Iteration Journey - The initial version of Genie was trained on 30,000 hours of 2D platform game footage, demonstrating an early understanding of the physical world [11]. - Genie 2 achieved a leap to 3D with near real-time performance and improved visual fidelity, simulating real-world lighting effects [12]. - Genie 3 further enhances this technology with a resolution of 720p, enabling immersive experiences and real-time interaction [13]. Group 4: Key Features - Genie 3 shifts input from images to text prompts, allowing for greater creative flexibility [15]. - It supports diverse environments, long-term interactions, and prompt-controlled world events, crucial for simulating rare occurrences in scenarios like autonomous driving [15]. Group 5: Technical Insights - Genie 3 maintains world consistency through an emergent property of its architecture, generating frames while referencing previous events [16]. - This causal generation method aligns with real-world time flow, enhancing the model's ability to simulate complex environments [16]. Group 6: Applications and Future Implications - Genie 3 is positioned as a platform for training embodied agents, potentially leading to groundbreaking strategies in AI development [17]. - It allows for low-cost, safe simulations of various scenarios, addressing the scarcity of real-world data for training [17]. Group 7: Creativity and Human Collaboration - DeepMind scientists argue that Genie 3's reliance on high-quality prompts enhances human creativity, providing a powerful tool for creators [19]. - This technology may herald a new form of interactive entertainment, enabling users to collaboratively create and explore interconnected virtual worlds [19]. Group 8: Limitations and Challenges - Genie 3 is still a research prototype with limitations, such as supporting only single-agent experiences and facing reliability issues [20]. - There exists a cognitive gap in fully simulating human experiences beyond visual and auditory senses [20]. Group 9: Technical Specifications and Industry Impact - Genie 3 operates on Google's TPU network, indicating significant computational demands, with training data likely sourced from extensive video content [21]. - The technology is expected to greatly impact the creative industry by simplifying the production of interactive graphics, while not simply replacing traditional game engines [22]. Group 10: Closing Remarks - Genie 3 represents a significant advancement in realistic world simulation, potentially bridging the long-standing "sim-to-real" gap in AI applications [23].
深夜,OpenAI、谷歌等更新多款模型
第一财经· 2025-08-06 07:17
Core Insights - The article discusses the recent product launches by major AI model companies, highlighting shifts in product strategies and advancements in AI capabilities [3][11]. Group 1: OpenAI Developments - OpenAI has released two new open-source models, gpt-oss-120b with 117 billion parameters and gpt-oss-20b with 21 billion parameters, both utilizing the MoE architecture [4][5]. - The gpt-oss-120b model can run on a single 80GB GPU, while gpt-oss-20b can operate on consumer devices with 16GB memory, allowing for local deployment on laptops and smartphones [5][6]. - OpenAI's new models have shown competitive performance in benchmark tests, with gpt-oss-120b scoring close to or exceeding the closed-source o4-mini model [5][6]. Group 2: Anthropic's Strategy - Anthropic has shifted to a strategy of more frequent incremental updates, exemplified by the release of Claude Opus 4.1, which improves upon its predecessor in areas like coding and data analysis [6][7]. - In benchmark tests, Claude Opus 4.1 scored 74.5%, surpassing Opus 4's 72.5%, indicating enhanced coding capabilities [7]. Group 3: Google's Innovations - Google introduced Genie 3, its first world model that supports real-time interaction, building on previous models like Genie 1 and 2 [8][9]. - Genie 3 can simulate complex environments and interactions, generating consistent visuals for several minutes, a significant improvement over Genie 2 [9][11]. - Despite its advancements, Genie 3 still faces limitations, such as restricted action spaces and challenges in simulating multiple agents in shared environments [11].
X @Demis Hassabis
Demis Hassabis· 2025-08-05 15:21
Technology & Innovation - Google DeepMind introduces Genie 3, a groundbreaking world model for creating interactive environments from text prompts [1] - Genie 3 enables the generation of playable environments from a single text prompt [1] - The technology allows for the creation of diverse environments, ranging from photorealistic landscapes to fantasy realms [1] Potential Applications - The generated videos are not just for viewing but can be explored interactively [1] - The possibilities for interactive and playable environments are described as endless [1]
Google Genie 3 - The Most Advanced World Simulator Ever...
Matthew Berman· 2025-08-05 14:02
Model Overview - Google announced Genie 3, a general-purpose world model for generating diverse interactive environments [1][8] - Genie 3 allows real-time interaction with improved consistency and realism compared to Genie 2 [12] - The model generates 720p high-quality environments [3] Technical Aspects - Genie 3 considers the entire previously generated trajectory, not just the previous frame, for autoregressive generation [15] - Consistency in Genie 3 is an emergent capability resulting from training scale, not pre-programming [19] - Genie 3 generates dynamic and rich worlds frame by frame based on world description and user actions, unlike methods relying on explicit 3D representation [20] Potential Applications - World models like Genie 3 can be used for training robots and agents [9] - The technology has potential applications in creating video games, movies, and television shows [9] - Google positions world models as a key step towards AGI by providing AI agents with unlimited simulation environments for training [9][10] Comparison with Previous Models - Genie 3 demonstrates significant improvements in consistency, detail, and generation length compared to Genie 2 [22][23] - Genie 3 allows for deeper world exploration than Genie 2 [23] Interactive Features - Users can prompt events in real-time, adding elements to the scene [21] - The model demonstrates realistic interactions, such as light moving out of the way of a jet ski and reflections in mirrors [6] - The model can simulate actions like painting, with paint only being applied when the brush touches the wall [29][30]
CAAI具身智能专委会主任蒋树强:世界模型是智能体进行决策的重要依据
机器人圈· 2025-08-04 11:38
Core Viewpoint - The core discussion revolves around the concept of embodied intelligence, emphasizing the intricate relationship between body, environment, and intelligence, and how these elements collectively contribute to the realization of intelligent systems [4]. Group 1: Embodied Intelligence - Embodied intelligence is defined by three key elements: body, environment, and intelligence, which interact in complex ways to enable intelligent behavior [4]. - The structure and sensory capabilities of the body significantly influence how an intelligent agent perceives and interacts with the world, highlighting the importance of physical attributes such as height and limb structure [4]. Group 2: Large Models in Embodied Intelligence - The training of embodied large models requires the integration of visual, linguistic, and behavioral data, necessitating a unified approach to data, computing power, and algorithms [4]. - The complexity of data in training embodied large models is heightened as it must encompass multimodal information, including behavior, physical parameters, and tactile data [4]. - Challenges remain in the generalization capabilities of embodied large models in real physical spaces, particularly concerning data complexity and sensor differences [4]. Group 3: World Models - World models serve as abstract representations of the real world, encompassing three-dimensional space, dynamic changes, object relationships, and memory, which are crucial for understanding and predicting environmental states [5]. - The relationship between world models and large models, as well as their connection to three-dimensional spaces, presents areas for further exploration [5]. - Current research often relies on simulators to generate data, but aligning virtual environments with real-world physical parameters remains a significant challenge [5].
Meta chief AI scientist Yann LeCun clarifies his role after the company hires another chief AI scientist
Business Insider· 2025-07-26 19:50
Core Insights - Meta has appointed Shengjia Zhao, co-creator of ChatGPT and former lead scientist at OpenAI, as the chief scientist at its Superintelligence Labs, indicating a strategic move in the AI talent acquisition landscape [1][2]. Group 1: Leadership and Structure - Shengjia Zhao will set the research agenda and scientific direction for Meta's Superintelligence Labs, working closely with CEO Mark Zuckerberg and Chief AI Officer Alexandr Wang [2]. - The formalization of Zhao's leadership role comes as Meta reports successful recruitment efforts and team assembly [2]. - Yann LeCun, who has been with Meta since 2013 and serves as the chief AI scientist for Meta's Fundamental AI Research (FAIR), clarified that his role remains unchanged despite Zhao's appointment [3]. Group 2: Research Focus - Meta's FAIR, established over a decade ago, focuses on advancing AI technology, leading to the release of the open-source large language model, Llama, in 2023 [8]. - The Superintelligence Labs will encompass FAIR and other teams, aiming to develop "personal superintelligence for everyone," as stated by Zuckerberg [9]. - LeCun is currently focused on developing a new model type, known as a world model, which could potentially replace large language models [8]. Group 3: Collaboration and Future Directions - Zhao's expertise in pioneering new scaling paradigms in AI research is expected to guide the scientific direction of Meta's AI initiatives [10]. - LeCun expressed enthusiasm about collaborating with Zhao to enhance the integration of new research into Meta's advanced models [10].
一边是毕业等于失业,一边是企业招不到人,太难了。。。
自动驾驶之心· 2025-07-23 09:56
Core Insights - The automatic driving industry is experiencing a paradox where job openings are abundant, yet companies struggle to find suitable talent. This is attributed to a shift in market expectations and a focus on sustainable business models rather than rapid expansion [2][3]. Industry Overview - Companies in the automatic driving sector are now more cautious with their spending, prioritizing survival and the establishment of viable business models over aggressive hiring and expansion strategies. This shift is expected to lead to significant industry adjustments within the next 1-3 years [2][3]. Talent Demand - There is an unprecedented demand for "top talent" and "highly compatible talent" in the automatic driving field. Companies are not necessarily unwilling to hire, but they are looking for candidates with exceptional skills and relevant experience [4][3]. Community and Resources - The "Automatic Driving Heart Knowledge Planet" is the largest community focused on automatic driving technology in China, established to provide resources and networking opportunities for professionals in the field. It has nearly 4000 members and over 100 industry experts contributing to discussions and knowledge sharing [9][10]. Learning and Development - The community offers comprehensive learning pathways covering various subfields of automatic driving technology, including perception, mapping, and AI model deployment. This initiative aims to support both newcomers and experienced professionals in enhancing their skills [9][12][13]. Job Placement Support - The community has established a direct referral mechanism with numerous automatic driving companies, facilitating job placements for members. This service aims to streamline the hiring process and connect qualified candidates with potential employers [10][9].
自动驾驶论文速递 | 世界模型、端到端、VLM/VLA、强化学习等~
自动驾驶之心· 2025-07-21 04:14
Core Insights - The article discusses advancements in autonomous driving technology, particularly focusing on the Orbis model developed by Freiburg University, which significantly improves long-horizon prediction in driving world models [1][2]. Group 1: Orbis Model Contributions - The Orbis model addresses shortcomings in contemporary driving world models regarding long-horizon generation, particularly in complex maneuvers like turns, and introduces a trajectory distribution-based evaluation metric to quantify these issues [2]. - It employs a hybrid discrete-continuous tokenizer that allows for fair comparisons between discrete and continuous prediction methods, demonstrating that continuous modeling (based on flow matching) outperforms discrete modeling (based on masked generation) in long-horizon predictions [2]. - The model achieves state-of-the-art (SOTA) performance with only 469 million parameters and 280 hours of monocular video data, excelling in complex driving scenarios such as turns and urban traffic [2]. Group 2: Experimental Results - The Orbis model achieved a Fréchet Video Distance (FVD) of 132.25 on the nuPlan dataset for 6-second rollouts, significantly lower than other models like Cosmos (291.80) and Vista (323.37), indicating superior performance in trajectory prediction [6][7]. - In turn scenarios, Orbis also outperformed other models, achieving a FVD of 231.88 compared to 316.99 for Cosmos and 413.61 for Vista, showcasing its effectiveness in challenging driving conditions [6][7]. Group 3: LaViPlan Framework - The LaViPlan framework, developed by ETRI, utilizes reinforcement learning with verifiable rewards to address the misalignment between visual, language, and action components in autonomous driving, achieving a 19.91% reduction in Average Displacement Error (ADE) for easy scenarios and 14.67% for hard scenarios on the ROADWork dataset [12][14]. - It emphasizes the transition from linguistic fidelity to functional accuracy in trajectory outputs, revealing a trade-off between semantic similarity and task-specific reasoning [14]. Group 4: World Model-Based Scene Generation - The University of Macau introduced a world model-driven scene generation framework that enhances dynamic graph convolution networks, achieving an 83.2% Average Precision (AP) and a 3.99 seconds mean Time to Anticipate (mTTA) on the DAD dataset, marking significant improvements [23][24]. - This framework combines scene generation with adaptive temporal reasoning to create high-resolution driving scenarios, addressing data scarcity and modeling limitations [24]. Group 5: ReAL-AD Framework - The ReAL-AD framework proposed by Shanghai University of Science and Technology and the Chinese University of Hong Kong integrates a three-layer human cognitive decision-making model into end-to-end autonomous driving, improving planning accuracy by 33% and reducing collision rates by 32% [33][34]. - It features three core modules that enhance situational awareness and structured reasoning, leading to significant improvements in trajectory planning accuracy and safety [34].
L4产业链跟踪系列第三期-头部Robotaxi公司近况跟踪(技术方向)
2025-07-16 06:13
Summary of Conference Call Company and Industry - The conference call primarily discusses advancements in the autonomous driving industry, specifically focusing on a company involved in Level 4 (L4) autonomous driving technology. Key Points and Arguments 1. **Technological Framework**: The company has a modular architecture for its autonomous driving system, which includes perception, prediction, control, and planning. This framework has evolved to incorporate advanced techniques like reinforcement learning and world models, although the core structure remains intact [1][2][3]. 2. **Transition to Large Models**: The industry is shifting from CNN architectures to transformer-based models. The company is gradually replacing its existing models with these new frameworks, which may take longer due to the high baseline performance of their current systems [3][4]. 3. **Data Utilization**: The company emphasizes the importance of both real and simulated data for model training. While real data is primarily used, there is a plan to increasingly incorporate simulated data to address data shortages, especially for control models [8][9][10]. 4. **Learning Techniques**: Imitation learning has been used for scenarios where rule-based approaches fail, while reinforcement learning is applied in end-to-end (E2E) models. The proportion of reinforcement learning used is not significant, indicating a cautious approach to its implementation [11][12]. 5. **Operational Deployment**: The company has deployed several autonomous vehicles in major cities like Beijing and Guangzhou, with plans to expand in Shenzhen and Shanghai. The current fleet consists of a few hundred vehicles [14][21]. 6. **Cost Structure**: The cost of vehicles includes hardware components such as multiple radars and cameras, with estimates suggesting that the total cost could be reduced to around 200,000 yuan [15][19]. 7. **Computational Resources**: The company is facing challenges with computational capacity, particularly with the integration of various models across different chips. There is a focus on optimizing the use of existing resources while planning for future upgrades [19][20]. 8. **Profitability Goals**: The company aims to achieve a break-even point by deploying a fleet of over 10,000 vehicles by 2027 or 2028. Current estimates suggest that achieving profitability may require a fleet size closer to 100,000 vehicles [26]. 9. **Market Positioning**: The company acknowledges competition from other players in the autonomous driving space, particularly in terms of regulatory approvals and operational capabilities. It aims to maintain a competitive edge by leveraging its faster acquisition of commercial licenses [27][28]. Other Important Content - The discussion highlights the ongoing evolution of the autonomous driving technology landscape, with a focus on the balance between technological advancement and operational scalability. The company is committed to addressing challenges in data acquisition, model training, and fleet management to enhance its market position [22][23][30].
双非研究生,今年找工作有些迷茫。。。
自动驾驶之心· 2025-07-14 14:04
Core Viewpoint - The article emphasizes the importance of staying updated with cutting-edge technologies in the fields of autonomous driving and embodied intelligence, highlighting the need for strong technical skills and knowledge in advanced areas such as large models, reinforcement learning, and 3D graphics [4][5]. Group 1: Industry Trends - There is a growing demand for talent in the fields of robotics and embodied intelligence, with many startups receiving significant funding and showing rapid growth potential [4][5]. - Major companies are shifting their focus towards more advanced technologies, moving from traditional methods to end-to-end solutions and large models, indicating a technological evolution in the industry [4][5]. - The community aims to build a comprehensive ecosystem that connects academia, products, and recruitment, fostering a collaborative environment for knowledge sharing and job opportunities [6]. Group 2: Technical Directions - The article outlines four key technical directions in the industry: visual large language models, world models, diffusion models, and end-to-end autonomous driving [9]. - It provides resources and summaries of various research papers and datasets related to these technologies, indicating a strong emphasis on research and development [10][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][35][36][38]. Group 3: Community and Learning Resources - The community offers a variety of learning materials, including video courses, hardware, and coding resources, aimed at equipping individuals with the necessary skills for the evolving job market [6]. - There is a focus on creating a supportive environment for discussions on the latest industry trends, technical challenges, and job opportunities, which is crucial for professionals looking to advance their careers [6].