世界模型
Search documents
在具身智能的岔路口,这场论坛把数据、模型、Infra聊透了
机器之心· 2025-09-29 02:52
Core Viewpoint - The field of embodied intelligence is experiencing unprecedented attention, yet key issues remain unresolved, including data scarcity and differing technical approaches [1][2][3] Group 1: Data and Technical Approaches - The industry is divided into two factions: the "real machine" faction, which relies on real-world data collection, and the "synthetic" faction, which believes in the feasibility of synthetic data for model training [5][12] - Galaxy General, representing the synthetic faction, argues that achieving generalization in embodied intelligence models requires trillions of data points, which is unsustainable through real-world data alone [8][9] - The "real machine" faction challenges the notion that real-world data is prohibitively expensive, suggesting that with sufficient investment, data collection can be scaled effectively [12][14] Group 2: Model Architecture - Discussions around the architecture of embodied intelligence models highlight a divide between end-to-end and layered approaches, with some experts advocating for a unified model while others support a hierarchical structure [15][19] - The layered architecture is seen as more aligned with biological evolution, while the end-to-end approach is criticized for potential error amplification [19][20] - The debate extends to the relevance of VLA (Vision-Language Alignment) versus world models, with some experts arguing that VLA is currently more promising due to its data efficiency [21][22] Group 3: Industry Trends and Infrastructure - The scaling law in embodied intelligence is beginning to emerge, indicating that expanding model and data scales could be effective [24] - The industry is witnessing an acceleration in the deployment of embodied intelligence technologies, with various companies sharing their experiences in human-robot interaction and industrial applications [24][29] - Cloud service providers, particularly Alibaba Cloud, are emphasized as crucial players in supporting the infrastructure needs of embodied intelligence companies, especially as they transition to mass production [29][31] Group 4: Alibaba Cloud's Role - Alibaba Cloud has been preparing for the exponential growth in data and computational needs associated with embodied intelligence, having developed capabilities to handle large-scale data processing and model training [33][35] - The company offers a comprehensive suite of cloud-based solutions to support both real and synthetic data production, enhancing efficiency and reducing costs [35][36] - Alibaba Cloud's unique position as a model provider and its engineering capabilities are seen as significant advantages in the rapidly evolving embodied intelligence landscape [37][41]
大神爆肝一个月,复刻DeepMind世界模型,300万参数就能玩实时交互像素游戏
3 6 Ke· 2025-09-28 10:51
Core Insights - The article discusses the development of TinyWorlds, a world model created by the X blogger anandmaj, which replicates the core ideas of DeepMind's Genie 3 with only 3 million parameters, capable of generating playable pixel-style environments in real-time [1][6]. Group 1: Understanding World Models - World models are a type of neural network that simulate the physical world by generating videos, showcasing emergent capabilities similar to those found in large language models (LLMs) [2][6]. - DeepMind's Genie 3 demonstrated that training on large-scale video data allows for the emergence of advanced behaviors without the need for action-labeled data [2][6]. Group 2: Dataset Construction - TinyWorlds' dataset consists of processed YouTube gaming videos, including titles like Pong, Sonic, Zelda, Pole Position, and Doom, which define the environments the model can generate [7]. Group 3: Model Architecture - The core of TinyWorlds is a Space-time Transformer that captures video information through spatial attention, temporal attention, and a feedforward network [10]. - The model employs an action tokenizer to automatically generate frame-to-frame action labels, enabling training on unlabeled data [18]. Group 4: Training Dynamics - The dynamics model serves as the "brain" of the system, combining video and action inputs to predict future frames, with initial performance limitations addressed by scaling the model [21]. - The introduction of masked frames and variance loss during training helps the model better utilize action signals [20]. Group 5: Performance and Future Prospects - Despite having only 3 million parameters, TinyWorlds can generate interactive pixel-style worlds, although the output remains somewhat blurry and incoherent [23][24]. - The author suggests that scaling the model to hundreds of billions of parameters and incorporating diffusion methods could significantly enhance the quality of generated content [24].
大神爆肝一个月,复刻DeepMind世界模型,300万参数就能玩实时交互像素游戏
机器之心· 2025-09-28 10:29
Core Insights - The article discusses the development of TinyWorlds, a minimal world model inspired by DeepMind's Genie 3, capable of generating playable pixel-style environments with only 3 million parameters [1][9][32]. Group 1: Understanding World Models - World models are a type of neural network that simulate the physical world by generating videos, showcasing emergent capabilities when trained on large-scale video data [5][7]. - The challenge lies in the need for frame-by-frame action labels for training, which limits the use of unannotated video data from the internet [5][6]. - Genie 1's solution involved training an action tokenizer to infer action labels, enabling the use of vast amounts of unannotated video for training [5][6]. Group 2: Dataset Construction - TinyWorlds' dataset consists of processed YouTube gaming videos, determining the range of environments the model can generate [11][12]. Group 3: Architecture and Tokenization Strategy - TinyWorlds employs a space-time transformer to handle three-dimensional video data, capturing video information through a three-layer mechanism [15][17]. - The model's architecture includes spatial attention, temporal attention, and a feedforward network to extract higher-level features [21][22]. - The video tokenizer compresses videos into tokens, while the action tokenizer predicts actions between frames, allowing training on unannotated data [24][26]. Group 4: Training the World Generator - The dynamics model serves as the system's "brain," predicting future frames based on video and actions, with performance improving significantly when the model size is increased [30][32]. - Despite its 3 million parameters, TinyWorlds can generate interactive pixel-style worlds, though the output remains somewhat blurry and incoherent [32].
Meta押注“安卓式”机器人平台:数十亿美元打造通用软件
Huan Qiu Wang Zi Xun· 2025-09-28 04:24
Group 1 - Meta's CTO Andrew Bosworth announced that humanoid robots have been elevated to a strategic priority level on par with augmented reality (AR) [1] - The company plans to invest "tens of billions" in developing a universal software platform for humanoid robots, aiming to become the "Android" of the robotics industry [1][2] - Meta does not intend to mass-produce hardware but will follow Google's open approach in the smartphone sector, allowing any compliant robot body to run Meta's operating system [2] Group 2 - Bosworth highlighted that the main challenge lies in software rather than hardware, as current humanoid robots struggle with dexterous manipulation despite being able to run and perform flips [2] - To address the challenges of fine motor skills, Meta established a "Super Intelligent AI Lab" earlier this year to create a "world model" that simulates real physical laws [2] - This model aims to provide robots with spatial awareness, force control prediction, and real-time decision-making capabilities, compensating for the limitations of traditional sensor feedback systems [2]
Meta CTO:人形机器人是下一个“AR级赌注” 瓶颈在于软件
Xin Lang Cai Jing· 2025-09-27 06:46
Core Insights - Meta's Chief Technology Officer Andrew Bosworth announced the initiation of a robotics research program earlier this year under Mark Zuckerberg's guidance, emphasizing that "hardware is not the bottleneck, the bottleneck is software" [1] Group 1 - The goal of the robotics research program is to develop a "world model" that aids robots in "software simulation to achieve dexterous arm movements" [1] - The future potential of this program includes the expansion to more complex movements and tasks [1]
2025人工智能产业十大关键词
机器人圈· 2025-09-26 09:29
Core Insights - The 2025 Artificial Intelligence Industry Conference highlighted ten key trends in AI, emphasizing the convergence of technology, applications, and ecosystems, leading to a clearer vision of a smart-native world [1]. Group 1: Foundation Super Models - In 2025, foundational models and reasoning models are advancing simultaneously, with a comprehensive capability increase of over 30% from late 2024 to August 2025 [3][4]. - Key features of leading large models include the integration of thinking and non-thinking modes, enhanced understanding and reasoning abilities, and built-in agent capabilities for real-world applications [4][6]. - The emergence of foundational super models simplifies user interaction, enhances workflow precision, and raises new data supply requirements [6]. Group 2: Autonomous Intelligent Agents - Highly encapsulated intelligent agent products are unlocking the potential of large models, showing better performance in complex tasks compared to single models [9][10]. - Current intelligent agents still have significant room for improvement, particularly in long-duration task execution and interconnectivity [12]. Group 3: Embodied Intelligence - Embodied intelligence is transitioning from laboratory settings to real-world applications, with models being deployed in practical scenarios [15][16]. - Challenges remain in data quality, model generalization, and soft-hard coordination for effective task execution [18]. Group 4: World Models - World models are emerging as a core pathway to general artificial intelligence (AGI), focusing on capabilities like data generation, action interpretation, environment interaction, and scene reconstruction [21][22]. - The development of world models faces challenges such as unclear definitions, diverse technical routes, and limited application scope [22]. Group 5: AI Reshaping Software - AI is transforming the software development lifecycle, with significant increases in token usage for programming tasks and the introduction of advanced AI tools [25][28]. - The role of software developers is evolving into more complex roles, leading to the emergence of "super individuals" [28]. Group 6: Open Intelligent Computing Ecosystem - The intelligent computing landscape is shifting towards an open-source model, fostering collaboration and innovation across various sectors [30][32]. - The synergy between software and hardware is improving, with domestic hardware achieving performance parity with leading systems [30]. Group 7: High-Quality Industry Data Sets - The focus of AI data set construction is shifting from general-purpose to high-quality industry-specific data sets, addressing critical quality issues [35][38]. - New data supply chains are needed to support advanced technologies like reinforcement learning and world models [38]. Group 8: Open Source as Standard - Open-source initiatives are reshaping the AI landscape, with significant adoption of domestic open-source models and a growing number of active developers [40][42]. - The business model is evolving towards "open-source free + high-level service charges," promoting cloud services and chip demand [42]. Group 9: Mitigating Model Hallucinations - The issue of hallucinations in large models is becoming a significant barrier to application, with ongoing research into mitigation strategies [44][46]. - Various approaches are being explored to enhance data quality, model training, and user-side testing to reduce hallucination rates [46]. Group 10: AI as an International Public Good - Global AI development is uneven, necessitating international cooperation to promote equitable access to AI technologies [49][51]. - Strategies are being implemented to address challenges in cross-border compliance and data flow, aiming to make AI a truly shared international public good [51].
把“会跑的代码世界”装进AI,Meta重磅开源首个代码世界模型:让AI像程序员一样思考
3 6 Ke· 2025-09-25 13:02
Core Insights - Meta's FAIR team has launched the Code World Model (CWM), a large language model (LLM) with 32 billion parameters and a context length of up to 131k tokens, aimed at integrating "world model" concepts into code generation and reasoning [1][2][3] - CWM is designed to not only write code but also simulate code execution, reason about program states, and self-detect and fix bugs, enhancing the model's understanding of code execution [2][3] Training Phases - The training of CWM is divided into three main phases: - Pre-training with 8 trillion tokens, where approximately 30% are code-related [3][4] - Mid-training, which incorporates 5 trillion tokens of world modeling data, extending the context length to 131k tokens [4][6] - Post-training (SFT + RL), involving 100 billion tokens for instruction and reasoning capabilities, followed by large-scale multi-task reinforcement learning with 172 billion tokens [4][10] Data Utilization - CWM's world model capabilities are driven by two main types of data during mid-training: - Execution traces from Python, which help the model learn how code execution alters local states [6][8] - Interaction trajectories from an automated agent that executes tasks in a repository, collecting around 3 million trajectories from 10.2k images and 3.15k repositories [9] Performance Metrics - In benchmark tests, CWM demonstrated strong performance, achieving 65.8% pass@1 on SWE-bench Verified with Test-Time-Scaling enabled, and notable results on LiveCodeBench (68.6%), Math-500 (96.6%), and AIME 2024 (76.0%) [10][12] - CWM's performance is competitive with larger or closed-source LLMs, nearing GPT-4 levels, although it has limitations in certain editing formats and multi-language scenarios [12] Industry Reception - The release of CWM has garnered significant attention, with Meta's AI researchers actively promoting it, highlighting its potential impact on software development [13][15] - While the open-sourcing of CWM's training checkpoints is praised for its utility in academic and engineering replication, there are concerns regarding the model's computational demands and the need for practical testing in real development environments [15]
代码生成要变天了?被质疑架空后,Yann LeCun携320亿参数开源世界模型“杀回来了”
AI前线· 2025-09-25 08:04
Core Viewpoint - The article discusses the release of the Code World Model (CWM) by Meta, which aims to enhance code generation capabilities by integrating a deeper understanding of code execution, addressing the limitations of previous models that could generate syntactically correct code but failed in execution [4][10]. Group 1: Model Overview - CWM is the first open-source code world model with 32 billion parameters, designed to advance code generation research based on world models [4][5]. - Unlike traditional models that rely on static code training, CWM incorporates dynamic interaction data from Python interpreters and Docker environments to improve its understanding and reasoning about code [7][14]. - The model can simulate the step-by-step execution of code, understanding how variables change and what feedback the program receives [7][10]. Group 2: Performance Metrics - CWM achieved a score of 65.8% on the SWE-bench Verified task, outperforming all other open-source models of similar size and nearing GPT-4 levels [8]. - It scored 68.6% on LiveCodeBench, 96.6% on Math-500, and 76.0% on AIME 2024, showcasing its strong performance across various benchmarks [8]. Group 3: Training Methodology - The training of CWM involved three key phases: pre-training, mid-training, and post-training, utilizing supervised fine-tuning (SFT) and reinforcement learning (RL) [15][16]. - The model was pre-trained on 8 trillion tokens, followed by mid-training on code world modeling data with an additional 5 trillion tokens, enhancing its contextual understanding [15][16]. Group 4: Industry Context and Implications - The release of CWM marks a significant step in Meta's AI strategy, especially following the restructuring of its AI business [5][23]. - The model's development reflects a shift towards balancing open-source initiatives with commercial interests, as Meta navigates its AI strategy amidst organizational changes [26].
汽车业AI“狂飙”,“轮式智能生命体”即将到来
Hua Xia Shi Bao· 2025-09-25 07:58
Core Insights - The automotive industry is on the brink of a significant transformation driven by artificial intelligence, moving from traditional vehicles to "intelligent wheeled life forms" that can interact with users and adapt to their needs [1][2][4] Industry Trends - The "Global AI Technology Conference" coincided with the release of the State Council's document on deepening the integration of AI with the real economy, setting a target for significant advancements by 2027 [2] - Industry leaders emphasize the need to shift focus from hardware specifications and price wars to creating vehicles that can think, learn, and collaborate within smart city traffic networks [2][4] Technological Developments - Discussions at the conference highlighted the importance of AI in transforming the automotive landscape, with leaders proposing that future vehicles will communicate with traffic systems to optimize travel efficiency [4][6] - The report released by the Automotive Home Research Institute identified five core trends shaping the future of China's electric vehicle market, including the widespread adoption of advanced driver-assistance systems and the emergence of RoboTaxi services [6][7] Consumer Behavior Changes - A significant shift in consumer perception has occurred, with the percentage of users viewing "intelligence" as the core advantage of electric vehicles rising from 30% to 73% over three years [7] - The relationship between consumers and vehicles is evolving into a "two-way selection" process, where consumers will demand rigorous testing of vehicles' intelligent features before purchase [8] Safety and Ethical Considerations - Despite advancements, a report indicated that 85% of tested vehicles required human intervention during assisted driving, highlighting the critical need for safety and reliability in AI-driven vehicles [8] - Industry leaders have called for a focus on core technological breakthroughs, expanding from "single vehicle intelligence" to "industry-wide intelligence," while maintaining safety and ethical standards [8][9] Company Strategies - Automotive Home is leveraging its data assets and self-developed models to enhance both consumer and business services, aiming for a dual upgrade in user experience and ecosystem services [9] - The integration of AI into vehicles is seen as essential for modern automotive products, positioning it as a critical competitive advantage for all market participants [9]
周鸿祎:语言是最重要的,语言掌握了就一通百通
Xin Lang Ke Ji· 2025-09-24 05:09
Core Insights - The discussion between Luo Yonghao and Zhou Hongyi emphasizes the importance of language in understanding and developing world models in artificial intelligence [1] - Zhou Hongyi critiques the focus on world models by figures like Yang Lequn from Meta and Li Feifei, arguing that the key to progress in AI lies in comprehending language [1] - The recent launch of Google's product "nano banana" showcases advancements in understanding graphics that surpass mere visual perception, integrating extensive knowledge [1] Summary by Categories Language and AI Development - Zhou Hongyi asserts that language is crucial for communication, knowledge transfer, logical reasoning, and world description, which are essential for creating effective world models [1] - The lack of progress in AI is attributed to a failure to grasp the significance of language, which serves as a key to understanding human knowledge and reasoning [1] Technological Advancements - The introduction of Google's "nano banana" product is highlighted as a significant breakthrough, demonstrating enhanced graphic understanding that integrates knowledge beyond visual capabilities [1] - The advancements in various models, including music, video, and visual models, are linked to breakthroughs in language comprehension [1]