众所周知视频不能P？北大施柏鑫团队、贝式计算CVPR研究：视频里轻松换衣服、加柯基

Core Viewpoint - The article discusses the advancements in video editing through the introduction of the VIRES method, which combines sketch and text guidance for video instance repainting, significantly improving editing efficiency and accuracy in complex scenes [2][10][31]. Group 1: VIRES Methodology - VIRES supports various editing operations such as repainting, replacing, generating, and removing video subjects, ensuring temporal consistency through the use of prior knowledge from text-to-video models [2][16]. - The method incorporates a Sequential ControlNet with a standardized adaptive scaling mechanism to effectively extract structural layouts and capture high-contrast sketch details [2][11]. - The research team introduced a sketch attention mechanism within the DiT backbone to interpret and inject fine-grained sketch semantics into the video editing process [2][14]. Group 2: Performance and Comparisons - VIRES has been shown to outperform existing state-of-the-art (SOTA) models across multiple metrics, including visual quality (PSNR), spatial structure consistency (SSIM), frame motion accuracy (WE), inter-frame consistency (FC), and text description consistency (TC) [22][24]. - The research team conducted comparisons with five advanced methods, demonstrating that VIRES achieved the best results in both objective evaluations and user studies [23][24]. Group 3: Data Set and Training - A large-scale video instance dataset named VireSet was created, containing 86,000 video segments, continuous video masks, detailed sketch sequences, and high-quality text descriptions to facilitate precise video instance repainting [6][8]. - The team improved the mask consistency of existing datasets by utilizing pre-trained models to annotate intermediate frames, increasing the mask's frame rate from 6 FPS to 24 FPS [8][12]. Group 4: Future Directions - The research team is also exploring panoramic video generation with a new method called PanoWan, which aims to extend pre-trained text-to-video models to panoramic contexts while maintaining high-quality outputs [31].