Workflow
视频推理
icon
Search documents
大模型学会拖进度条看视频了,阿里新研究让视频推理告别脑补,实现证据链思考
3 6 Ke· 2026-01-29 09:29
Core Insights - The research team from Alibaba's Future Life Lab highlights that the effectiveness of models in video reasoning tasks is significantly influenced by how they are taught to "think," contrasting with mathematical reasoning where reinforcement learning (RL) shows substantial performance improvements [1][11] Group 1: ReWatch Dataset - The ReWatch dataset consists of 10,000 videos, 170,000 question-answer pairs, and 135,000 reasoning chains, addressing three main issues in existing training data: rough video descriptions, overly simplistic Q&A, and a heavy reliance on textual common sense rather than video content [2][4] - Key features of the ReWatch dataset include high-fidelity temporal subtitles, high-difficulty video Q&A that require detailed video content for answers, and a video-grounded reasoning chain that simulates human-like review and confirmation behaviors [2][4] Group 2: ReWatch-R1 Model - The ReWatch-R1 model employs a SFT+RL paradigm with an innovative reward mechanism that emphasizes the importance of the reasoning process, rather than just the final answer [6][8] - The process reward is calculated through observation and reasoning rewards, ensuring that the model learns to derive answers based on accurate observations and effective reasoning actions [8] Group 3: Experimental Results - ReWatch-R1 has achieved state-of-the-art (SOTA) performance across five mainstream video reasoning benchmarks, significantly outperforming all comparable open-source models, validating the effectiveness of the proposed methodology [9] - A critical insight from the experiments indicates that while supervised fine-tuning (SFT) does not surpass direct answering modes, the RL phase leads to a remarkable performance leap for the "thinking mode," underscoring the necessity of explicit, evidence-based reasoning processes in complex video tasks [11] Group 4: Conclusion - The work on ReWatch-R1 contributes valuable insights and resources to the field of video understanding, addressing the core bottleneck of high-quality video reasoning data and successfully teaching models to engage in deep thinking based on video evidence [13]
大模型学会拖进度条看视频了!阿里新研究让视频推理告别脑补,实现证据链思考 | ICLR 2026
量子位· 2026-01-29 08:27
Core Insights - The research team from Alibaba's Future Life Lab highlights that the effectiveness of models in video reasoning tasks is significantly influenced by how they are taught to "think" [1] - They propose a high-quality video reasoning dataset called ReWatch and a state-of-the-art model named ReWatch-R1, which can "rewatch" videos like humans to enhance reasoning capabilities [1] Group 1: ReWatch Dataset - The ReWatch dataset consists of 10,000 videos, 170,000 question-answer pairs, and 135,000 reasoning chains, addressing three main issues in existing training data: rough video descriptions, overly simplistic Q&A, and a heavy reliance on textual common sense rather than video content [2][4] - Key features of the ReWatch dataset include: 1. High-fidelity temporal captions that provide detailed event descriptions with precise timestamps, forming a solid factual basis for complex reasoning [2] 2. High-difficulty video Q&A that ensures questions depend on video details, preventing models from relying on guessing or common sense [2] 3. Video-grounded reasoning chains that simulate human behavior of "rewatching and confirming" through a multi-agent framework, ensuring reasoning steps are closely tied to video content [2] Group 2: ReWatch-R1 Model - The training of the ReWatch-R1 model employs a SFT+RL paradigm with an innovative reward mechanism that emphasizes the importance of the reasoning process [6] - The core of the training method is the process reward mechanism (GRPO with O&R Reward), which supervises and rewards the model's intermediate reasoning steps rather than just the final answer [6][8] - The process reward is calculated based on: 1. Observation Reward, which evaluates the accuracy of the model's observations against high-fidelity captions [8] 2. Reasoning Reward, which assesses the effectiveness of the model's reasoning actions based solely on its observations [8] Group 3: Experimental Results and Insights - ReWatch-R1 has achieved state-of-the-art performance across five mainstream video reasoning benchmarks, significantly outperforming all comparable open-source models [9] - A key insight from the research is that reinforcement learning (RL) is crucial for unlocking the "thinking" potential of models, as it allows for a substantial performance leap in the reasoning mode compared to the direct answering mode [11][12] - The study emphasizes that explicit, step-by-step reasoning processes supported by evidence are vital for tackling complex video tasks, with RL being the key to fostering this capability [12][14]
计算机行业周报:小红书Video-Thinker打破工具依赖,DeepSeek推出mHC-20260106
Huaxin Securities· 2026-01-06 12:34
Investment Rating - The report maintains a "Buy" rating for several companies in the AI and computing sectors, including Weike Technology (301196.SZ), Nengke Technology (603859.SH), Hehe Information (688615.SH), and Maixinlin (688685.SH) [9]. Core Insights - The report highlights the introduction of the Video-Thinker model by Xiaohongshu, which breaks the dependency on external tools for video reasoning, achieving state-of-the-art (SOTA) performance with a 7B parameter version [3][22]. - DeepSeek's new architecture, mHC, shows significant performance improvements with only a 6.7% increase in training time, marking a breakthrough in model efficiency [31][32]. - Kimi, a Chinese AI startup, completed a $500 million Series C funding round, with a post-money valuation of $4.3 billion, focusing on the development of its K3 model and talent incentives for 2026 [4][44]. Summary by Sections 1. Computing Dynamics - The report notes stable pricing in computing power leasing, with specific rates for various configurations [21]. - Xiaohongshu's Video-Thinker model integrates key capabilities such as temporal grounding and visual description, achieving new benchmarks in video reasoning [22][23]. - The model's training paradigm includes a two-stage process that enhances its reasoning capabilities while reducing reliance on external tools [26][27]. 2. AI Application Dynamics - Character.AI experienced an 8.32% increase in weekly traffic, indicating growing interest in AI applications [30]. - DeepSeek's mHC architecture addresses traditional bottlenecks in model efficiency, providing a robust framework for enhancing model capabilities [31][32]. 3. AI Financing Trends - Kimi's recent funding round will support the development of its K3 model and expansion of its talent pool, following significant technological advancements in 2025 [4][44]. - Meta's acquisition of Manus for $4-5 billion underscores the strategic importance of AI applications and the integration of advanced AI capabilities into its ecosystem [5][6]. 4. Market Performance - The report provides comparative performance metrics for various AI models, showcasing the advancements made by Video-Thinker over existing solutions [28][29]. - The overall market sentiment remains positive, with a focus on the long-term growth potential of AI applications and computing technologies [7].
让模型自己找关键帧、视觉线索,小红书Video-Thinker破解视频推理困局
机器之心· 2026-01-02 03:12
Core Insights - The article discusses the revolutionary advancements in video reasoning through the introduction of the "Thinking with Videos" paradigm, specifically the Video-Thinker model, which enhances the model's ability to autonomously navigate and understand temporal sequences in videos [2][6][10]. Group 1: Model Development and Methodology - Video-Thinker integrates "temporal grounding" and "visual captioning" into the model's cognitive chain, eliminating reliance on external tools and enabling the model to autonomously identify key frames and extract visual cues [2][10]. - The research team constructed the Video-Thinker-10K dataset, consisting of 10,000 high-quality samples, and employed a two-phase training strategy of "supervised fine-tuning + reinforcement learning" to enhance the model's self-exploration and self-correction capabilities [3][10]. - The model achieved state-of-the-art (SOTA) performance in various challenging video reasoning benchmarks, significantly surpassing existing baselines with its 7 billion parameters [3][22]. Group 2: Data Quality and Training Process - The construction of high-quality training data is crucial for developing complex reasoning capabilities, leading to the integration of six major datasets into Video-Thinker-10K, which combines precise temporal annotations with detailed visual descriptions [12][13]. - The training process involved a structured thinking paradigm where the model learns to output specific labels such as <time> and <caption>, ensuring a rigorous "locate - perceive - reason" sequence [16][18]. - The reinforcement learning phase, utilizing Group Relative Policy Optimization (GRPO), allowed the model to explore and optimize its reasoning strategies, leading to emergent cognitive behaviors akin to human metacognition [19][22]. Group 3: Performance Evaluation - Video-Thinker-7B demonstrated significant advantages across various video reasoning benchmarks, establishing a new SOTA for models with 7 billion parameters [25][29]. - The model's performance was evaluated through both in-domain and out-of-domain assessments, showcasing its ability to generalize effectively to unseen scenarios [24][29]. - The model achieved an accuracy of 43.22% on the Video-Holmes benchmark and 80.69% on the VRBench, outperforming previous models by notable margins [29][30]. Group 4: Key Findings and Implications - The model's success is attributed to its internal capabilities of grounding and captioning, which were quantitatively assessed and found to be superior to those of baseline models [32][36]. - The findings indicate that relying on external tools can hinder performance, as demonstrated by experiments showing that simple plug-and-play tools did not enhance, but rather degraded, the model's reasoning capabilities [34][35]. - The article concludes that Video-Thinker's approach of integrating core internal capabilities rather than depending on large parameters and datasets represents a new paradigm in video reasoning, with potential applications across various industries [39].
知名科技基金经理最新操作!
券商中国· 2025-10-28 23:33
Core Viewpoint - The article discusses the significant performance of overseas computing power sectors, represented by optical modules and PCBs, which have provided substantial returns for heavily invested funds, but have also led to increased divergence after substantial price increases [1] Summary by Sections Fund Performance - On October 28, the third-quarter report of well-known fund manager Jin Zicai from Caitong Fund was released, showing that the net value growth rate of the Caitong Growth Preferred A class share reached 90.4% in Q3, outperforming the benchmark by over 80 percentage points [2][3] Portfolio Adjustments - Jin Zicai made significant adjustments to his holdings, drastically reducing positions in leading optical module companies like NewEase and Tianfu Communication, while increasing investments in core PCB industry players such as Shenzhen South Circuit, Shengyi Technology, and Huitian Technology [2][3] - After the adjustments, the top five holdings of the fund included Industrial Fulian, Shenzhen South Circuit, Shengyi Technology, Huitian Technology, and Zhongji Xuchuang [3] Market Insights - Jin Zicai noted that the market's understanding of the optical communication sector has improved, leading to a reduction in the fund's holdings in this area. He believes that the PCB industry may experience unexpected price increases due to structural supply-demand imbalances by 2026 [3] - Despite reducing exposure to optical modules, Jin Zicai continues to heavily overweight the overseas computing power sector, indicating that the growth certainty of overseas AI has increased, and demand for computing power is expected to grow rapidly in 2026 and 2027 [4][5] Investment Strategy - The fund's management scale increased from 4.618 billion to 6.525 billion yuan, with a focus on maintaining research and tracking of other sectors, aiming for proactive and replicable investments in high-quality companies aligned with industry trends [5]
6大基准全面碾压!TW-GRPO刷新视频推理天花板,CLEVRER准确率突破50.4%!
机器人大讲堂· 2025-07-06 05:23
Core Viewpoint - The rapid development of multi-modal large language models (MLLMs) is significantly enhancing video reasoning capabilities, driven by reinforcement learning (RL) as a key engine for this technological revolution [1] Group 1: TW-GRPO Framework Introduction - The TW-GRPO framework is proposed to address challenges in reasoning quality and reward granularity in video reasoning tasks, inspired by the traditional GRPO framework [2] - TW-GRPO integrates focused thinking and multi-level soft reward mechanisms for multi-choice QA tasks [3] Group 2: Key Improvements in TW-GRPO - The framework enhances information weighting and reward mechanism design, applying a soft reward mechanism from video localization to video reasoning tasks [4] - A dynamic weighting mechanism prioritizes high information density tokens, improving reasoning accuracy and efficiency by focusing on key content [4] - The multi-level reward mechanism redefines rewards, allowing for partial correctness in answers, thus improving training stability and efficiency [5] Group 3: Data Augmentation and Training Efficiency - TW-GRPO introduces a question-answer inversion (QAI) data augmentation technique to convert single-choice tasks into multi-choice formats, effectively expanding the training data pool [6] - This approach disrupts traditional equal treatment of tokens, enhancing training efficiency and reasoning performance through differentiated information processing [6] Group 4: Experimental Validation - Extensive experiments demonstrate TW-GRPO's effectiveness in video reasoning and general understanding tasks, outperforming Video-R1 by 18.8%, 1.8%, and 1.6% in various benchmarks [12][15] - The framework shows faster convergence and more stable learning processes compared to traditional GRPO, with shorter output sequences indicating more efficient reasoning [11][17] Group 5: Qualitative Analysis of Reasoning Paths - A qualitative comparison of reasoning paths between T-GRPO and TW-GRPO illustrates significant improvements in accuracy and efficiency in dynamic visual cue reasoning tasks [22]
视频推理界的“福尔摩斯测试”:所有大模型,统统不及格 | 论文代码开源
量子位· 2025-05-29 07:19
Core Viewpoint - The introduction of the Video-Holmes benchmark by Tencent ARC Lab and City University of Hong Kong reveals that current large models fail to perform adequately in complex video reasoning tasks, highlighting significant gaps in their reasoning capabilities [1][7]. Group 1: Benchmark Overview - Video-Holmes serves as a new benchmark for evaluating complex video reasoning abilities, designed to address the shortcomings of existing benchmarks that often feature overly simplistic video sources and questions [1][8]. - The benchmark includes 270 short films, each 1-5 minutes long, and poses seven high-reasoning requirement questions that compel models to extract and connect multiple key pieces of information scattered throughout the videos [9]. Group 2: Model Performance - All tested large models performed poorly, with none passing the benchmark, indicating a widespread deficiency in their reasoning capabilities [5][6]. - The average scores across various reasoning categories (e.g., Social Reasoning, Intent and Motive Chain, Time Causal Inference) were notably low, with the highest average score being 51.3 for Gemini-2.5-Pro [6]. Group 3: Reasoning Process Analysis - The analysis of reasoning processes showed that while models could correctly perceive visual information, they struggled significantly with linking clues and often overlooked critical visual details [18]. - Specific examples of reasoning errors were noted, where models misinterpreted interactions or failed to accurately assess relationships based on video content [15][16]. Group 4: Accessibility and Tools - The Video-Holmes benchmark and its associated resources, including evaluation codes and model integration tools, have been made open-source and are available on platforms like GitHub and HuggingFace [19][20]. - Users interested in testing their models against the benchmark can access a comprehensive guide and necessary commands for setup and evaluation [19][20].