大模型学会拖进度条看视频了！阿里新研究让视频推理告别脑补，实现证据链思考

Core Insights - The research team from Alibaba's Future Life Lab highlights that the effectiveness of models in video reasoning tasks is significantly influenced by how they are taught to "think" [1] - They propose a high-quality video reasoning dataset called ReWatch and a state-of-the-art model named ReWatch-R1, which can "rewatch" videos like humans to enhance reasoning capabilities [1] Group 1: ReWatch Dataset - The ReWatch dataset consists of 10,000 videos, 170,000 question-answer pairs, and 135,000 reasoning chains, addressing three main issues in existing training data: rough video descriptions, overly simplistic Q&A, and a heavy reliance on textual common sense rather than video content [2][4] - Key features of the ReWatch dataset include: 1. High-fidelity temporal captions that provide detailed event descriptions with precise timestamps, forming a solid factual basis for complex reasoning [2] 2. High-difficulty video Q&A that ensures questions depend on video details, preventing models from relying on guessing or common sense [2] 3. Video-grounded reasoning chains that simulate human behavior of "rewatching and confirming" through a multi-agent framework, ensuring reasoning steps are closely tied to video content [2] Group 2: ReWatch-R1 Model - The training of the ReWatch-R1 model employs a SFT+RL paradigm with an innovative reward mechanism that emphasizes the importance of the reasoning process [6] - The core of the training method is the process reward mechanism (GRPO with O&R Reward), which supervises and rewards the model's intermediate reasoning steps rather than just the final answer [6][8] - The process reward is calculated based on: 1. Observation Reward, which evaluates the accuracy of the model's observations against high-fidelity captions [8] 2. Reasoning Reward, which assesses the effectiveness of the model's reasoning actions based solely on its observations [8] Group 3: Experimental Results and Insights - ReWatch-R1 has achieved state-of-the-art performance across five mainstream video reasoning benchmarks, significantly outperforming all comparable open-source models [9] - A key insight from the research is that reinforcement learning (RL) is crucial for unlocking the "thinking" potential of models, as it allows for a substantial performance leap in the reasoning mode compared to the direct answering mode [11][12] - The study emphasizes that explicit, step-by-step reasoning processes supported by evidence are vital for tackling complex video tasks, with RL being the key to fostering this capability [12][14]