视频推理

Search documents
6大基准全面碾压!TW-GRPO刷新视频推理天花板,CLEVRER准确率突破50.4%!
机器人大讲堂· 2025-07-06 05:23
Core Viewpoint - The rapid development of multi-modal large language models (MLLMs) is significantly enhancing video reasoning capabilities, driven by reinforcement learning (RL) as a key engine for this technological revolution [1] Group 1: TW-GRPO Framework Introduction - The TW-GRPO framework is proposed to address challenges in reasoning quality and reward granularity in video reasoning tasks, inspired by the traditional GRPO framework [2] - TW-GRPO integrates focused thinking and multi-level soft reward mechanisms for multi-choice QA tasks [3] Group 2: Key Improvements in TW-GRPO - The framework enhances information weighting and reward mechanism design, applying a soft reward mechanism from video localization to video reasoning tasks [4] - A dynamic weighting mechanism prioritizes high information density tokens, improving reasoning accuracy and efficiency by focusing on key content [4] - The multi-level reward mechanism redefines rewards, allowing for partial correctness in answers, thus improving training stability and efficiency [5] Group 3: Data Augmentation and Training Efficiency - TW-GRPO introduces a question-answer inversion (QAI) data augmentation technique to convert single-choice tasks into multi-choice formats, effectively expanding the training data pool [6] - This approach disrupts traditional equal treatment of tokens, enhancing training efficiency and reasoning performance through differentiated information processing [6] Group 4: Experimental Validation - Extensive experiments demonstrate TW-GRPO's effectiveness in video reasoning and general understanding tasks, outperforming Video-R1 by 18.8%, 1.8%, and 1.6% in various benchmarks [12][15] - The framework shows faster convergence and more stable learning processes compared to traditional GRPO, with shorter output sequences indicating more efficient reasoning [11][17] Group 5: Qualitative Analysis of Reasoning Paths - A qualitative comparison of reasoning paths between T-GRPO and TW-GRPO illustrates significant improvements in accuracy and efficiency in dynamic visual cue reasoning tasks [22]
视频推理界的“福尔摩斯测试”:所有大模型,统统不及格 | 论文代码开源
量子位· 2025-05-29 07:19
金磊 整理自 凹非寺 量子位 | 公众号 QbitAI 一个新的Benchmark,竟让大模型在 复杂 视频推理 这事儿上 统统不及格! 这就是腾讯ARC Lab和香港城市大学最新推出的 Video-Holmes —— 如其名,它可以说是视频推理界的 "福尔摩斯测试" , 通过让多模态大模型参与 " 推理杀人凶手 " , " 解析作案意图" 等高难度的推理任 务,以展现他们复杂视频推理能力的边界 。 而且Video-Holmes可以说是规避了现在业内已有的Benchmark痛点,即视频源和问题都偏简单,没法反映推理模型和非推理模型之间的差 距。 值得一提的是,这个Benchmark的 "一键测评懒人包" ,目前已经上线到了GitHub和HuggingFace,有做视频推理相关的小伙伴,可以去挑 战一下了(地址见文末)。 让大模型全军覆没的新Benchmark 正如刚才提到的,现有视频推理基准(如 VCR-Bench、MVBench 等)主要评估模型的视觉感知和接地能力。 举个例子 。 在这个例子中,为了寻找男人真正的死因,模型需要 主动思考 需要关注的视觉信息,并通过 逻辑关联 分散在不同视频片段中的多个相关 ...