Weakly Supervised Dynamic Scene Graph Generation
Search documents
ICCV 2025 | 基于时序增强关系敏感知识迁移的弱监督动态场景图生成
机器之心· 2025-09-03 08:33
Core Viewpoint - The article discusses a new method for weakly supervised dynamic scene graph generation, highlighting the limitations of existing object detection quality in dynamic scenes and proposing a temporal-enhanced relation-aware knowledge transferring approach to improve detection performance and scene graph generation quality [2][5][8]. Method Introduction - The proposed method, TRKT, addresses the performance bottleneck in weakly supervised dynamic scene graph generation by enhancing object detection quality through a temporal-aware and relation-sensitive knowledge transfer mechanism [5][10]. - TRKT utilizes attention maps generated from object and relation decoders to optimize external object detectors, thereby improving the quality of generated scene graphs [8][10]. Knowledge Transfer Mechanism - The method consists of two main components: relation-aware knowledge mining and a dual-stream fusion module [10][15]. - Relation-aware knowledge mining generates attention maps that highlight object and interaction areas, while the dual-stream fusion module combines these attention maps with external detection results to refine object localization and confidence scores [10][19]. Experimental Results - The proposed method shows significant improvements in average precision (AP) and average recall (AR) compared to existing methods, with an increase of 13.0% in AP and 1.3% in AR for object detection [25]. - In dynamic scene graph generation tasks, the method outperforms baseline models, achieving performance improvements across all evaluation metrics [25][26]. Ablation Studies - Ablation experiments demonstrate the effectiveness of individual components, with the confidence boosting module (CBM) and localization refinement module (LRM) contributing to average precision improvements of 1.2% and 2.0%, respectively [28]. - The integration of these modules leads to a combined average precision increase of 2.8%, indicating that enhancements in bounding box accuracy and confidence scores complement each other [28]. Visualization Results - Visual comparisons show that the proposed method generates more complete and accurate scene graphs than baseline models, benefiting from the introduced temporal-enhanced relation-aware knowledge and dual-stream fusion module [31].