Workflow
机器之心
icon
Search documents
2026 将近,世界模型到底更「世界」了吗?
机器之心· 2025-12-13 02:30
Core Viewpoint - The recent launch of GWM Worlds and GWM Robotics by Runway pushes video generation towards an interactive "world simulation" paradigm, reigniting discussions on the definition and scope of "world models" as interfaces for creation and interaction, simulators for training and evaluation, or cognitive frameworks for reasoning and decision-making [1]. Group 1: Evolution of World Models - Over the past two years, world models have evolved to be considered on par with LLMs in the AGI landscape, transitioning from a narrow definition focused on reinforcement learning to a broader understanding that includes generative modeling [4]. - Initially, world models were seen as internal environment models for agents, predicting future states based on current conditions and actions, allowing for internal simulation and decision-making [5]. - The engineering perspective defined world models as a combination of three capabilities: compressing high-dimensional perception into usable representations, predicting future states over time, and utilizing predictions for planning and decision-making [6]. - By 2024, the understanding of world models expanded to encompass general world evolution modeling, with a trend from language generation to image generation, and ultimately to 3D and world generation [6]. - The boundaries of the world model concept have become more ambiguous, with ongoing debates about the nature of representations, the incorporation of physical laws, and the organization of input relationships [6]. Group 2: Industry Layout and Trends - Major companies are investing in world models, questioning whether they are enhancing their "data engines" or building new frameworks for "spatiotemporal cognition" [3]. - In February 2024, OpenAI referred to the video generation model Sora as "world simulators," emphasizing their ability to learn the three-dimensional structure and physical laws of the real world [6]. - Concurrently, LeCun introduced V-JEPA, which focuses on predicting masked video segments in abstract representation space, allowing for higher training efficiency by discarding unpredictable information [6]. - The current discourse has shifted from whether to develop world models to how to model them, with debates on whether to abstract from pixel levels or to directly operate in abstract spaces [7]. - There is a recognition that existing approaches may only capture partial physical laws, indicating a need for representations of isolated objects and a priori laws of change across space and time to achieve a coherent world model [7]. Group 3: Definition and Ambiguity of World Models - By 2025, world models are positioned alongside LLMs, with companies like Google DeepMind, Meta, and Nvidia shifting focus from pure LLMs to world models, aiming for "Physical AI + superintelligence" due to stagnation in LLM advancements [8]. - The distinction between world models and existing generative AI lies in the former's goal to construct internal representations of environments that include physical, temporal, and spatial dimensions for planning and decision-making [9]. - The term "world model" has become ambiguous, referring to latent states within systems, game-like simulators for training agents, or any content pipeline capable of generating navigable 3D scenes [9]. - An analysis from Entropy Town in November 2025 categorized world models into three technical routes: interface, simulator, and cognitive framework, highlighting the ongoing ambiguity in the field [9].
告别「盲目自信」,CCD:扩散语言模型推理新SOTA
机器之心· 2025-12-13 01:13
对此, 华为小艺香港团队、香港城市大学及香港大学 的研究人员们共同提出了一种全新的 上下文一致性解码算法(Coherent Contextual Decoding, CCD) ,充分 利用扩散过程中的上下文增广,从理论上纠正了传统 DLM 推理策略的 "短视性",并进一步采用自适应解码方案在多种开源 DLMs 上同时实现了 3.48 倍的加速和 3.9% 的性能提升。该方案不仅适配 Any-oder 生成,且在半自回归 Block-wise 解码设定下也获得了提升,扩散语言模型的高效推理时代,或许已经到来。 研究背景 今年以来,以 Dream 和 LLaDA 为主的开源扩散语言模型大放异彩,实现了和同尺寸自回归 LLM 相当的通用能力,且展现出了 DLMs 在全局规划和双向上下文理 解任务上的优势 。 扩散语言模型(Diffusion Language Models)以其独特的 "全局规划" 与并行解码能力广为人知,成为 LLM 领域的全新范式之一。然而在 Any-order 解码模式下,其 通常面临推理速度较慢且生成逻辑不连贯等问题。 论文标题: Beyond Confidence: Adaptive an ...
苹果光速撤回RLAX论文:用了谷歌TPU和阿里Qwen,作者中还有庞若鸣
机器之心· 2025-12-13 01:13
Core Viewpoint - The article discusses Apple's recently withdrawn paper on a scalable reinforcement learning framework called RLAX, which utilizes Google's TPU and other cloud services, highlighting the company's engineering capabilities in AI infrastructure despite recent personnel changes [1][35]. Group 1: Paper Overview - The paper titled "RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs" was submitted on December 6 and quickly withdrawn after being made public [1][7]. - RLAX is designed for efficient execution of advanced reinforcement learning algorithms on large-scale distributed TPU clusters [12]. Group 2: Technical Contributions - RLAX employs a parameter-server architecture, allowing for logical separation of training, inference, and validation components, which enhances resource allocation flexibility [14]. - The framework supports preemptive scheduling, enabling immediate resource recovery for higher-priority tasks without crashing the training process [15]. - RLAX addresses key challenges in post-training reinforcement learning, offering programmable configuration options for managing on-policy and off-policy RL [16]. Group 3: Experimental Results - During experiments, RLAX improved the pass@8 accuracy of the QwQ-32B model by 12.8% in just 12 hours and 48 minutes using 1024 TPU v5p [24]. - The framework's development involved using Google's TPU, Amazon's AWS Lambda for testing, and a Chinese open-source model, showcasing a collaborative approach across different technologies [26]. Group 4: Author Background - The paper lists several authors, including Kelvin Zou, who has transitioned to Meta, and Cheng Leong, a long-time Apple employee, indicating a shift in talent within the AI sector [8][9].
港大开源ViMax火了,实现AI自编自导自演
机器之心· 2025-12-12 10:06
想象一下,只需要一句话描述,AI 就能为你拍出一部完整的短剧?以后可能真的人人都能当导演了。不用学复杂的拍摄技巧,不用买昂贵设备,甚至不用找演 员。有个好想法,AI 就能帮你实现。 为了让这个想法变成现实,香港大学黄超教授团队开源了 ViMax 框架,并在 GitHub 获得 1.4k + 星标,专注于 Agentic Video Generation 的前沿探索。通过多智能体 协作,ViMax 实现了真正的 "自编自导自演"—— 从创意构思到成片输出的完整自动化,把传统影视制作的每个环节都搬进了 AI 世界。 ViMax 的 "一人剧组" 有多强? 它就像一个数字化的全能团队 ——AI 编剧负责写剧本,AI 导演掌控节奏和镜头语言,AI 摄像师负责构图和视觉呈现,AI 剪辑师 精心打磨每个细节。这些 AI 小伙伴会自己讨论创意,分配任务,协调配合。你只需要输入一个想法,AI 就能独立完成整个制作流程,输出千赞级别的视频内容。 在 AI 视频制作领域,我们正在见证一场从 "片段生成" 到 "系统化制作" 的重要转变。这不仅仅是技术升级,更是创作方式的根本改变。 实验室地址:https://sites.goog ...
提示词一响,烂片登场,OpenAI谈下200+迪士尼顶级IP出场费
机器之心· 2025-12-12 10:06
机器之心报道 机器之心编辑部 AI 版权战不再是想着怎么把 IP 彻底锁起来不让 AI 碰,而是要谈一个合适的出场费。你猜,朱迪、尼克还要多久就会飙脏话? 消息一出,网友直接炸锅。 未来三年,只要你是 Sora 用户,迪士尼这些顶级 IP 角色,都能随手捏。 迪士尼不但不找麻烦,反而选择亲自开闸放水。 事情是这样的。 迪士尼刚官宣,将向 OpenAI 投资 10 亿美元,并签下一份为期三年的合作协议,授权 Sora 使用旗下 IP,用于生成短视频内容。 一夜之间,OpenAI 直接拿到了 200 多个国际公认顶级IP的合法使用权。 包括迪士尼经典,米奇、米妮、灰姑娘、小美人鱼等。 很多人喜欢的皮克斯IP,比如《玩具总动员》、《头脑特工队》、《超能陆战队》。 当然,这次授权仅限动画或插画版本,不涉及任何真人演员的肖像与声音(毕竟,太难、太棘手)。 OpenAI 拿到了授权,还顺手收下 10 亿美元新投资。那么,迪士尼图啥呢? 首先,10 亿美元对于年营收 900 多亿美元的迪斯尼来说,不算啥。 其次,有了 OpenAI 的股份,迪士尼就能将这些角色带到 Z 世代和 Alpha 世代聚集的平台。 还有生产力工具。 ...
里程碑时刻!首个100B扩散语言模型来了,技术报告揭秘背后细节
机器之心· 2025-12-12 04:31
机器之心报道 编辑:杜伟、张倩 万万没想到,年初还是个小众方向的「扩散语言模型(dLLM)」,现在已经被扩展到千亿参数的规模了。 前段时间,我们在 HuggingFace 页面发现了两个新模型:LLaDA2.0-mini 和 LLaDA2.0-flash。它们 来自蚂蚁集团与人大、浙大、西湖大学组成的联合团队,都采用 了 MoE 架构。前者总参数量 为 16B,后者总参数量则高达 100B—— 在「扩散语言模型」这个领域,这是从未见过的规模。 更令人欣慰的是,模型变大了,也确实变强了:在涵盖知识、推理、编码、数学、智能体与对齐几大维度的 47 个基准测试中,LLaDA2.0-flash 平均得分 73.18, 与强 AR(自回归)模型 Qwen3-30B-A3B-Instruct-2507(73.60)持平 ,在编码(如 HumanEval、MBPP)、智能体(BFCL)等复杂任务上优势显著。 长期以来,自回归生成范式在大模型领域始终占据主导地位,这种从前到后依次生成下一个 token 的方法曾被寄予厚望。然而,其固有弊端也逐渐显现:长文本生 成的计算成本较高、推理速度较慢,且难以捕捉 token 之间的双向 ...
Runway深夜炸场:一口气发布5大更新,首个通用世界模型来了
机器之心· 2025-12-12 04:31
Core Insights - Runway has made significant announcements, introducing five major updates that showcase its ambition in AI video and multimedia generation technology [1][3] - The updates indicate a shift from merely generating videos to simulating the physical world, marking a critical transition in the industry [4][34] Group 1: Gen-4.5 Video Generation Model - Gen-4.5 is the latest flagship video generation model, featuring impressive image quality and introducing native audio generation and editing capabilities [6][9] - The model achieves high physical accuracy and visual precision, with realistic movement of objects and fluid dynamics [9][10] - Gen-4.5 supports multi-shot editing, allowing users to modify initial scenes and apply changes throughout the entire video [14][15] - Despite its advancements, Runway acknowledges that Gen-4.5 still has common limitations found in video models, which are crucial for their world model research [15] Group 2: General World Model (GWM-1) - GWM-1 is Runway's first general world model, built on Gen-4.5, utilizing autoregressive methods for frame-by-frame predictions [18][19] - The model allows user intervention based on application scenarios, simulating future events in real-time [19] - GWM-1 includes three variants: GWM Worlds for environment simulation, GWM Avatars for interactive video generation, and GWM Robotics for training robots with synthetic data [21][22] Group 3: GWM Worlds - GWM Worlds enables real-time environment simulation, creating immersive and explorable spaces based on static scenes [23][24] - The model maintains spatial consistency during exploration, allowing for accurate responses to user-defined physical rules [24][25] Group 4: GWM Robotics - GWM Robotics supports counterfactual generation, exploring different robotic trajectories and outcomes [26][27] - It includes a Python SDK for generating videos based on robotic actions, enhancing training data without the need for expensive real-world data collection [28] Group 5: GWM Avatars - GWM Avatars is an audio-driven interactive video generation model that simulates natural human movements and expressions [29][30] - The model has broad application potential, including personalized tutoring, customer support, training simulations, and interactive entertainment [31][32] Conclusion - Runway's updates signify a pivotal moment in the industry, transitioning from video generation to true world simulation, indicating a deeper understanding of the physical world's underlying logic [34][35]
全球强化学习+VLA范式,PI*0.6背后都有这家中国公司技术伏笔
机器之心· 2025-12-12 03:41
Core Insights - The article discusses the significance of integrating Vision-Language-Action (VLA) models with Reinforcement Learning (RL) in the field of Embodied AI, emphasizing the limitations of imitation learning and the necessity for robust learning methods [1][2][4]. Group 1: Importance of VLA+RL - VLA models are being developed to apply powerful Vision-Language Models (VLM) in the control of robots, primarily through supervised fine-tuning (SFT) [2]. - Imitation learning alone is insufficient for robots to handle novel situations, necessitating the use of RL to enhance robustness and persistence in task execution [4]. Group 2: Challenges in Applying RL to VLA - The integration of RL with VLA faces three main challenges: environmental differences, model instability, and computational demands [6]. - Direct application of RL algorithms to large VLA models can lead to catastrophic forgetting and training collapse, making it difficult to maintain performance [6]. Group 3: Solutions to VLA's RL Challenges - The industry has proposed three types of solutions to address the challenges faced by VLA in RL applications, with a focus on internalizing high-value behaviors through SFT [7][13]. - The iRe-VLA model introduces a two-phase iterative learning process that alternates between online RL for exploration and supervised learning for consolidation [10][15]. Group 4: iRe-VLA Model Architecture - The iRe-VLA model consists of a VLM backbone for understanding images and instructions, and an Action Head for translating features into control signals [11]. - The use of Low-Rank Adaptation (LoRA) technology allows for efficient training without the need for full model fine-tuning [12]. Group 5: Experimental Results and Analysis - Extensive experiments in both simulated environments and real-world scenarios demonstrate the effectiveness of the iRe-VLA method, showing significant improvements in task success rates [26][30]. - The iRe-VLA model outperformed traditional methods, achieving a success rate increase from 43% to 83% in benchmark tasks [30]. Group 6: Conclusion and Future Implications - The article concludes that the iRe-VLA approach provides a viable solution to the challenges of deploying large models in robotic control, ensuring stability and continuous learning [37][42]. - Future research directions include efficient exploration and learning of new skills under sparse rewards, as well as developing scalable RL algorithms for large VLA models [40].
Meta「内战」升级:做「神一般的AI」,还是守住「社交帝国」?
机器之心· 2025-12-12 03:41
Core Viewpoint - Meta is shifting its strategic focus from the "metaverse" to artificial intelligence (AI), facing multiple internal challenges as a result [1]. Group 1: Internal Conflicts - A newly formed top AI team at Meta is experiencing friction with existing core business departments over resource allocation, development goals, and cultural integration [2]. - Internal conflicts have escalated due to differences in priorities regarding AI development, with long-term executives advocating for using data from Instagram and Facebook to enhance social media and advertising, while the new AI team led by Alexandr Wang aims to develop advanced AI models without immediate product application focus [5][12]. - The AI team believes that the existing executive focus on social media improvements is hindering the development of cutting-edge AI models [5]. Group 2: Resource Allocation and Financials - To support its ambitious AI goals, Meta is reallocating resources, significantly cutting the budget for Reality Labs, which oversees VR, AR, and metaverse initiatives [8]. - Reality Labs has incurred losses exceeding $70 billion since the end of 2020, and Meta plans to reduce its budget by up to 30% (approximately $4 billion to $6 billion) next year, with funds redirected to the AI team [11]. - Meta's projected spending in AI for this year is estimated to be between $66 billion and $72 billion, nearly equivalent to the total losses from its metaverse business in recent years [11]. Group 3: Strategic Challenges - Meta's current situation mirrors historical challenges faced by tech giants, such as Microsoft's failure to adapt to mobile operating systems, which resulted in a loss of market dominance [17]. - The company is simultaneously engaged in multiple costly battles across the metaverse, short video markets, and AI, leading to a dilution of strategic focus [19]. - The failure of Llama 4 has raised concerns about whether resource allocation towards the metaverse has impeded the AI team's progress at a critical time [19]. Group 4: Cultural and Organizational Dynamics - Tensions within Meta are ongoing due to differing philosophies between established executives and the new AI elite, with some employees prioritizing resources for the profitable social media business [12]. - The departure of Yann LeCun over ideological differences highlights the intense cultural shifts within the organization as it navigates its new direction [21]. - The outcome of Meta's internal struggles will determine whether it faces a collapse similar to Google+ or can reorganize effectively to achieve success akin to Google's Gemini project [22].
NUS LV Lab新作|FeRA:基于「频域能量」动态路由,打破扩散模型微调的静态瓶颈
机器之心· 2025-12-12 03:41
Core Viewpoint - The article discusses the introduction of the FeRA (Frequency-Energy Constrained Routing) framework, which addresses the limitations of existing static parameter-efficient fine-tuning (PEFT) methods in diffusion models by implementing a dynamic routing mechanism based on frequency-energy principles [3][23]. Group 1: Research Background and Limitations - The current PEFT methods, such as LoRA and AdaLoRA, utilize a static strategy that applies the same low-rank matrix across all time steps, leading to a misalignment between parameters responsible for structure and detail, resulting in wasted computational resources [8][9]. - The research team identifies a significant "low-frequency to high-frequency" evolution pattern in the denoising process of diffusion models, which is not isotropic and has distinct phase characteristics [7][23]. Group 2: FeRA Framework Components - FeRA consists of three core components: - Frequency-Energy Indicator (FEI), which extracts frequency-energy distribution features in latent space using Gaussian difference operators [11]. - Soft Frequency Router, which dynamically calculates the weights of different LoRA experts based on the energy signals provided by FEI [12]. - Frequency-Energy Consistency Loss (FECL), which ensures that the parameter updates in the frequency domain align with the model's original residual error, enhancing training stability [13]. Group 3: Experimental Validation - The research team conducted extensive testing on multiple mainstream bases, including Stable Diffusion 1.5, 2.0, 3.0, SDXL, and FLUX.1, focusing on style adaptation and subject customization tasks [19]. - In style adaptation tasks, FeRA achieved optimal or near-optimal results in FID (image quality), CLIP Score (semantic alignment), and Style (MLLM scoring) across various style datasets [20]. - In the DreamBooth task, FeRA demonstrated remarkable text controllability, allowing for specific prompts to be effectively executed [21][26]. Group 4: Conclusion and Future Implications - The FeRA framework represents a significant advancement in fine-tuning diffusion models by aligning the tuning mechanism with the physical laws of the generation process, thus providing a pathway for efficient and high-quality fine-tuning [23][27]. - This work not only sets new state-of-the-art (SOTA) benchmarks but also offers valuable insights for future fine-tuning in more complex tasks such as video and 3D generation [27].