Workflow
语义鸿沟
icon
Search documents
AAAI 2026 | 北航、东京大学填补AI「语义鸿沟」,过程感知视频理解如何找到「状态」锚点?
机器之心· 2025-12-06 01:15
Core Insights - The article discusses a new framework for video understanding called TSS (Task-Step-State), developed by a team from Beihang University and the University of Tokyo, which addresses the semantic gap between abstract text instructions and concrete video actions [2][3] - The TSS framework introduces "State" as a visual anchor, allowing AI to better comprehend procedural activities like cooking or repairing devices [2][3] Data Challenges - Existing methods for procedural video learning face data challenges, relying on either expensive annotations or weak supervision from external knowledge bases, which leads to a semantic gap [2][5] - The traditional task-step structure is too abstract, and TSS enhances this by generating a third semantic layer, "State," which provides visually grounded snapshots of each step [7][19] Training Methodology - TSS employs a progressive "Hierarchy Unfolding" training strategy, which is designed to align with cognitive processes, allowing for a U-shaped learning path from Task to State and back [9][10] - This method emphasizes the importance of understanding specific visual evidence, enabling the model to refine its understanding of steps and tasks based on detailed state information [14][18] Experimental Results - The research team tested the TSS framework on the COIN and CrossTask datasets, achieving significant performance improvements over state-of-the-art models [15][16] - The results indicate that the introduction of the "State" layer and the progressive training strategy are key drivers in enhancing procedural video understanding capabilities [19][21] Conclusion - The TSS framework demonstrates that explicitly modeling object state changes can effectively bridge the gap between natural language and the physical world, providing a new approach for developing intelligent systems that understand both high-level planning and detailed execution [23]