Workflow
ProPhy
icon
Search documents
生成视频总出物理bug?用VLM迁移+token级对齐,让燃烧在正确位置发生,碰撞遵循动量守恒丨CVPR 2026近满分接收
量子位· 2026-03-19 07:09
Core Viewpoint - The article discusses the advancements in generative video models, particularly focusing on the ProPhy framework, which aims to enhance the physical understanding and spatial alignment of video generation, moving from mere visual imitation to true physical simulation [1][8][33]. Group 1: Current State of Generative Video Models - Generative video models like Wan and NVIDIA's Cosmos can create highly realistic dynamic scenes that appear to mimic the real world [1][2]. - Despite their visual realism, these models often lack a true understanding of physical principles, leading to inconsistencies in generated videos [3][6][10]. Group 2: Limitations of Existing Models - Current models primarily rely on implicit learning and coarse global physical category labels, which do not allow for a clear understanding of different physical laws and their evolution in reality [10]. - There is a lack of fine-grained spatial alignment, meaning that models cannot accurately position physical events in the generated scenes [10]. Group 3: Introduction of ProPhy - ProPhy introduces a new progressive physical alignment framework that enables video diffusion models to achieve layered physical understanding and spatial physical alignment [8][9]. - This framework allows models to not only determine what physical phenomena to present but also where these phenomena should occur in the video [8][9]. Group 4: Mechanism of ProPhy - ProPhy employs a two-stage physical expert mechanism: the Semantic Physical Expert (SEB) for macro understanding of physical structures and the Refinement Expert Block (REB) for precise spatial alignment [13][14]. - SEB identifies potential physical phenomena from textual prompts, while REB dynamically assigns the most suitable physical expert to each spatial location [13][14]. Group 5: Experimental Results - ProPhy shows significant improvements in physical correctness and semantic adherence, with a 19.7% increase in joint metrics on the VideoPhy2 benchmark [20][22]. - In dynamic performance evaluations, ProPhy enhances the Dynamic Degree metric and overall quality scores, demonstrating its effectiveness in generating physically consistent videos [23]. Group 6: Implications and Future Directions - ProPhy represents a shift from visual similarity to adherence to physical rules, indicating a move towards a controllable physical world model [26][29]. - Future developments may include integrating continuous dynamics modeling and physical engines with generative models, potentially leading to a new AI form capable of simulating the operation of the world [34].