Workflow
思维链(Chain-of-Thought)
icon
Search documents
VideoCoF:将「时序推理」引入视频编辑,无Mask实现高精度编辑与长视频外推!
机器之心· 2025-12-23 04:15
Core Insights - The article discusses the innovative video editing framework VideoCoF, which addresses the dilemma of achieving high precision without relying on masks, a common limitation in existing models [2][4][28] - VideoCoF utilizes a "See-Reason-Edit" approach inspired by large language models (LLMs), allowing for effective video editing with only 50k training samples, achieving state-of-the-art (SOTA) results [5][14][28] Group 1: Pain Points and Innovations - Existing video editing models face a trade-off between high precision and general applicability, with expert models requiring masks and general models lacking accuracy [3][7] - VideoCoF introduces the Chain of Frames (CoF) mechanism, restructuring the video editing process into three stages: Seeing, Reasoning, and Editing, which enhances the model's ability to establish relationships between editing instructions and video regions [6][8] Group 2: Technical Mechanisms - The framework incorporates a unique RoPE (Rotary Position Encoding) alignment strategy, enabling the model to handle longer videos during inference while maintaining smooth motion and avoiding artifacts [11][16] - VideoCoF demonstrates remarkable data efficiency, achieving superior performance with only 50k video pairs compared to baseline models that require significantly larger datasets [12][17] Group 3: Experimental Validation - In experiments, VideoCoF achieved an instruction-following score of 8.97, outperforming other models like ICVE (7.79) and VACE (7.47), indicating its superior understanding of user instructions [14][19] - The success ratio of VideoCoF reached 76.36%, significantly higher than commercial models like Lucy Edit (29.64%) and ICVE (57.76%) [18][19] Group 4: Reasoning Frame Design - The design of the reasoning frame is crucial; experiments showed that a progressive gray mask significantly improved instruction-following scores compared to static masks [21][26] - The introduction of the CoF mechanism and RoPE design led to notable improvements in both fidelity and temporal consistency for long video extrapolation [24] Group 5: Practical Applications - VideoCoF showcases versatile editing capabilities, including multi-instance removal, object addition, instance replacement, and localized style transfer, demonstrating its potential for various video editing tasks [29]
张祥雨发现的多模态AI内耗难题,北大找到了解法
3 6 Ke· 2025-09-19 10:52
Core Insights - The main issue in multimodal AI training is the internal conflict between understanding and generating capabilities, which often leads to performance degradation in one area when the other is improved [1][5] - A new framework called UAE has been proposed to address the fundamental problem of conflicting training objectives between understanding and generating tasks, suggesting a unified approach instead of separate KPIs [3][5] Group 1: Challenges in Multimodal AI - Zhang Xiangyu highlighted that in unified multimodal model training, visual understanding and generation can coexist but rarely collaborate, leading to internal strife [1] - The complexity of image generation requires intricate spatial planning, physical knowledge, and semantic reasoning, which the Transformer model struggles to handle in a single forward pass [1] - The traditional approach of decoupling understanding and generation has led to a lack of true synergy, resulting in models that coexist without effective collaboration [9] Group 2: The UAE Framework - The UAE framework proposes a radical shift by eliminating separate KPIs and establishing a unified pipeline with a single quality control standard [10] - This framework draws inspiration from the classic auto-encoder model, where the understanding task is likened to encoding and the generation task to decoding [11][15] - The UAE framework aims to ensure that the output image is a near-perfect reconstruction of the original input, thus aligning the objectives of both understanding and generating modules [17][18] Group 3: Training Methodology - UAE introduces a three-phase training strategy called Unified-GRPO, which emphasizes a "left-right loop, two-way reinforcement" approach to enhance collaboration between understanding and generating modules [20] - The first phase focuses on establishing basic communication between the two modules, ensuring that the generation module can reconstruct images from the understanding module's outputs [22][23] - Subsequent phases involve specialized training for each module, where the understanding module learns to generate detailed descriptions, and the generation module learns to execute complex instructions based on those descriptions [24][29] Group 4: Performance Outcomes - The UAE model has demonstrated significant improvements in generating detailed and accurate descriptions compared to other models, achieving higher scores in various evaluation metrics [36][37] - In the GenEval benchmark, UAE achieved a comprehensive score of 0.86, ranking first among unified models, particularly excelling in tasks requiring precise understanding [38] - The results indicate that with the right objectives and training methods, AI systems can discover more effective information representation and transmission strategies [38][39]
端到端模型!GraphCoT-VLA:面向模糊指令的操作任务的VLA模型
具身智能之心· 2025-08-13 00:04
Core Viewpoint - The article introduces GraphCoT-VLA, an advanced end-to-end model designed to enhance robot operations under ambiguous instructions and in open-world conditions, significantly improving task success rates and response times compared to existing methods [3][15][37]. Group 1: Introduction and Background - The VLA (Vision-Language-Action) model has become a key paradigm in robotic operations, integrating perception, understanding, and action to interpret and execute natural language commands [5]. - Existing VLA models struggle with ambiguous language instructions and unknown environmental states, limiting their effectiveness in real-world applications [3][8]. Group 2: GraphCoT-VLA Model - GraphCoT-VLA addresses the limitations of current VLA models by incorporating a structured Chain-of-Thought (CoT) reasoning module, which enhances understanding of ambiguous instructions and improves task planning [3][15]. - The model features a real-time updatable 3D pose-object graph that captures the spatial configuration of robot joints and the topological relationships of objects in three-dimensional space, allowing for better interaction modeling [3][9]. Group 3: Key Contributions - The introduction of a novel CoT architecture enables dynamic observation analysis, interpretation of ambiguous instructions, generation of failure feedback, and prediction of future object states and robot actions [15][19]. - The model integrates a dropout-based mixed reasoning strategy to balance rapid inference and deep reasoning, ensuring real-time performance [15][27]. Group 4: Experimental Results - Experiments demonstrate that GraphCoT-VLA significantly outperforms existing methods in task success rates and action fluidity, particularly in scenarios with ambiguous instructions [37][40]. - In the "food preparation" task, GraphCoT-VLA improved accuracy by 10% over the best baseline, while in the "outfit selection" task, it outperformed the leading model by 18.33% [37][38]. Group 5: Ablation Studies - The introduction of the pose-object graph improved success rates by up to 18.33%, enhancing the model's accuracy and action generation fluidity [40]. - The CoT module significantly improved the model's ability to interpret and respond to ambiguous instructions, demonstrating enhanced task planning and future action prediction capabilities [41].