大模型学会拖进度条看视频了，阿里新研究让视频推理告别脑补，实现证据链思考

Core Insights - The research team from Alibaba's Future Life Lab highlights that the effectiveness of models in video reasoning tasks is significantly influenced by how they are taught to "think," contrasting with mathematical reasoning where reinforcement learning (RL) shows substantial performance improvements [1][11] Group 1: ReWatch Dataset - The ReWatch dataset consists of 10,000 videos, 170,000 question-answer pairs, and 135,000 reasoning chains, addressing three main issues in existing training data: rough video descriptions, overly simplistic Q&A, and a heavy reliance on textual common sense rather than video content [2][4] - Key features of the ReWatch dataset include high-fidelity temporal subtitles, high-difficulty video Q&A that require detailed video content for answers, and a video-grounded reasoning chain that simulates human-like review and confirmation behaviors [2][4] Group 2: ReWatch-R1 Model - The ReWatch-R1 model employs a SFT+RL paradigm with an innovative reward mechanism that emphasizes the importance of the reasoning process, rather than just the final answer [6][8] - The process reward is calculated through observation and reasoning rewards, ensuring that the model learns to derive answers based on accurate observations and effective reasoning actions [8] Group 3: Experimental Results - ReWatch-R1 has achieved state-of-the-art (SOTA) performance across five mainstream video reasoning benchmarks, significantly outperforming all comparable open-source models, validating the effectiveness of the proposed methodology [9] - A critical insight from the experiments indicates that while supervised fine-tuning (SFT) does not surpass direct answering modes, the RL phase leads to a remarkable performance leap for the "thinking mode," underscoring the necessity of explicit, evidence-based reasoning processes in complex video tasks [11] Group 4: Conclusion - The work on ReWatch-R1 contributes valuable insights and resources to the field of video understanding, addressing the core bottleneck of high-quality video reasoning data and successfully teaching models to engage in deep thinking based on video evidence [13]