VideoCoF
Search documents
VideoCoF:将「时序推理」引入视频编辑,无Mask实现高精度编辑与长视频外推!
机器之心· 2025-12-23 04:15
Core Insights - The article discusses the innovative video editing framework VideoCoF, which addresses the dilemma of achieving high precision without relying on masks, a common limitation in existing models [2][4][28] - VideoCoF utilizes a "See-Reason-Edit" approach inspired by large language models (LLMs), allowing for effective video editing with only 50k training samples, achieving state-of-the-art (SOTA) results [5][14][28] Group 1: Pain Points and Innovations - Existing video editing models face a trade-off between high precision and general applicability, with expert models requiring masks and general models lacking accuracy [3][7] - VideoCoF introduces the Chain of Frames (CoF) mechanism, restructuring the video editing process into three stages: Seeing, Reasoning, and Editing, which enhances the model's ability to establish relationships between editing instructions and video regions [6][8] Group 2: Technical Mechanisms - The framework incorporates a unique RoPE (Rotary Position Encoding) alignment strategy, enabling the model to handle longer videos during inference while maintaining smooth motion and avoiding artifacts [11][16] - VideoCoF demonstrates remarkable data efficiency, achieving superior performance with only 50k video pairs compared to baseline models that require significantly larger datasets [12][17] Group 3: Experimental Validation - In experiments, VideoCoF achieved an instruction-following score of 8.97, outperforming other models like ICVE (7.79) and VACE (7.47), indicating its superior understanding of user instructions [14][19] - The success ratio of VideoCoF reached 76.36%, significantly higher than commercial models like Lucy Edit (29.64%) and ICVE (57.76%) [18][19] Group 4: Reasoning Frame Design - The design of the reasoning frame is crucial; experiments showed that a progressive gray mask significantly improved instruction-following scores compared to static masks [21][26] - The introduction of the CoF mechanism and RoPE design led to notable improvements in both fidelity and temporal consistency for long video extrapolation [24] Group 5: Practical Applications - VideoCoF showcases versatile editing capabilities, including multi-instance removal, object addition, instance replacement, and localized style transfer, demonstrating its potential for various video editing tasks [29]