思维链(Chain-of-Thought)
Search documents
VideoCoF:将「时序推理」引入视频编辑,无Mask实现高精度编辑与长视频外推!
机器之心· 2025-12-23 04:15
目前, 模型、代码均已开源,4 步编辑一条视频,训练数据 VideoCoF-50k 预计本周内开源! 本文第一作者是 UTS 博士生杨向鹏,主要研究方向是视频生成和世界模型;第二作者是谢集,浙江大学的四年级本科生,主要研究方向统一多模态大模型和视频 生成。通讯作者是吴强教授,主要研究方向为计算机视觉和模式识别。 现有的视频编辑模型往往面临「鱼与熊掌不可兼得」的困境:专家模型精度高但依赖 Mask,通用模型虽免 Mask 但定位不准。 来自悉尼科技大学和浙江大学的研 究团队提出了一种全新的视频编辑框架 VideoCoF, 受 LLM「思维链」启发,通过「看 - 推理 - 编辑」的流程,仅需 50k 训练数据,就在多项任务上取得了 SOTA 效果,并完美支持长视频外推! 痛点:精度与通用的「两难困境」 在 AIGC 时代,视频编辑已经有了长足进步,但仍存在一个明显的痛点: 论文链接: https://arxiv.org/abs/2512.07469 项目主页: https://videocof.github.io/ 代码 / 模型: https://github.com/knightyxp/VideoCoF De ...
张祥雨发现的多模态AI内耗难题,北大找到了解法
3 6 Ke· 2025-09-19 10:52
Core Insights - The main issue in multimodal AI training is the internal conflict between understanding and generating capabilities, which often leads to performance degradation in one area when the other is improved [1][5] - A new framework called UAE has been proposed to address the fundamental problem of conflicting training objectives between understanding and generating tasks, suggesting a unified approach instead of separate KPIs [3][5] Group 1: Challenges in Multimodal AI - Zhang Xiangyu highlighted that in unified multimodal model training, visual understanding and generation can coexist but rarely collaborate, leading to internal strife [1] - The complexity of image generation requires intricate spatial planning, physical knowledge, and semantic reasoning, which the Transformer model struggles to handle in a single forward pass [1] - The traditional approach of decoupling understanding and generation has led to a lack of true synergy, resulting in models that coexist without effective collaboration [9] Group 2: The UAE Framework - The UAE framework proposes a radical shift by eliminating separate KPIs and establishing a unified pipeline with a single quality control standard [10] - This framework draws inspiration from the classic auto-encoder model, where the understanding task is likened to encoding and the generation task to decoding [11][15] - The UAE framework aims to ensure that the output image is a near-perfect reconstruction of the original input, thus aligning the objectives of both understanding and generating modules [17][18] Group 3: Training Methodology - UAE introduces a three-phase training strategy called Unified-GRPO, which emphasizes a "left-right loop, two-way reinforcement" approach to enhance collaboration between understanding and generating modules [20] - The first phase focuses on establishing basic communication between the two modules, ensuring that the generation module can reconstruct images from the understanding module's outputs [22][23] - Subsequent phases involve specialized training for each module, where the understanding module learns to generate detailed descriptions, and the generation module learns to execute complex instructions based on those descriptions [24][29] Group 4: Performance Outcomes - The UAE model has demonstrated significant improvements in generating detailed and accurate descriptions compared to other models, achieving higher scores in various evaluation metrics [36][37] - In the GenEval benchmark, UAE achieved a comprehensive score of 0.86, ranking first among unified models, particularly excelling in tasks requiring precise understanding [38] - The results indicate that with the right objectives and training methods, AI systems can discover more effective information representation and transmission strategies [38][39]
端到端模型!GraphCoT-VLA:面向模糊指令的操作任务的VLA模型
具身智能之心· 2025-08-13 00:04
Core Viewpoint - The article introduces GraphCoT-VLA, an advanced end-to-end model designed to enhance robot operations under ambiguous instructions and in open-world conditions, significantly improving task success rates and response times compared to existing methods [3][15][37]. Group 1: Introduction and Background - The VLA (Vision-Language-Action) model has become a key paradigm in robotic operations, integrating perception, understanding, and action to interpret and execute natural language commands [5]. - Existing VLA models struggle with ambiguous language instructions and unknown environmental states, limiting their effectiveness in real-world applications [3][8]. Group 2: GraphCoT-VLA Model - GraphCoT-VLA addresses the limitations of current VLA models by incorporating a structured Chain-of-Thought (CoT) reasoning module, which enhances understanding of ambiguous instructions and improves task planning [3][15]. - The model features a real-time updatable 3D pose-object graph that captures the spatial configuration of robot joints and the topological relationships of objects in three-dimensional space, allowing for better interaction modeling [3][9]. Group 3: Key Contributions - The introduction of a novel CoT architecture enables dynamic observation analysis, interpretation of ambiguous instructions, generation of failure feedback, and prediction of future object states and robot actions [15][19]. - The model integrates a dropout-based mixed reasoning strategy to balance rapid inference and deep reasoning, ensuring real-time performance [15][27]. Group 4: Experimental Results - Experiments demonstrate that GraphCoT-VLA significantly outperforms existing methods in task success rates and action fluidity, particularly in scenarios with ambiguous instructions [37][40]. - In the "food preparation" task, GraphCoT-VLA improved accuracy by 10% over the best baseline, while in the "outfit selection" task, it outperformed the leading model by 18.33% [37][38]. Group 5: Ablation Studies - The introduction of the pose-object graph improved success rates by up to 18.33%, enhancing the model's accuracy and action generation fluidity [40]. - The CoT module significantly improved the model's ability to interpret and respond to ambiguous instructions, demonstrating enhanced task planning and future action prediction capabilities [41].