北大字节开源首个时空推理视频模型！思考过程全透明，性能超越GPT-4o

Core Insights - The article discusses the launch of Open-o3 Video, an open-source model developed by a joint team from Peking University and ByteDance, which integrates explicit spatio-temporal evidence into video reasoning, allowing AI to not only answer questions but also indicate when and where events occur [2][8]. Group 1: Model Capabilities - Open-o3 Video employs a non-agent architecture, completing the "see-think-evidence-answer" loop in a single response without complex tool calls or multi-round reasoning [4]. - In various video reasoning tests, Open-o3 Video achieved a performance improvement of 24.2%, surpassing models like GPT-4o and Gemini-2-Flash [5][46]. Group 2: Research Background - Video understanding is one of the most complex tasks in multi-modal large models (MLLM), requiring the model to recognize objects and actions while also determining their timing and location [8][10]. - Existing models like Video-R1 and VideoRFT have improved logical consistency in video understanding but still lack the ability to provide visual evidence for their answers [10][11]. Group 3: Data Construction - The team created the first unified corpus for explicit spatio-temporal reasoning, STGR (Spatio-Temporal Grounded Reasoning), consisting of STGR-CoT-30k for supervised fine-tuning and STGR-RL-36k for reinforcement learning [18][20]. - The data includes four types of tasks: temporal localization, spatial localization, spatio-temporal localization, and video question answering [20]. Group 4: Training Process - Open-o3 Video utilizes a two-stage training mechanism: cold-start pre-training and reinforcement learning based on GSPO [26][28]. - The cold-start phase focuses on teaching the model to generate structured responses with spatio-temporal annotations, while the reinforcement learning phase optimizes the model's ability to align spatio-temporal evidence [30][31]. Group 5: Experimental Results - Open-o3 Video demonstrated significant improvements in temporal IoU and visual IoU, with overall mAM increasing by 14.4% and mLGM by 24.2%, outperforming other large closed-source models [46][47]. - The model's ability to generate verifiable answers enhances its interpretability and reliability, providing a higher level of explanation alongside accuracy [48]. Group 6: Ablation Studies - Ablation studies confirmed the importance of the two-stage training mechanism, showing that combining supervised fine-tuning with reinforcement learning significantly enhances model performance [54][57]. - The adaptive temporal proximity and temporal gating mechanisms were found to improve the model's accuracy and reliability in spatio-temporal reasoning [58][60]. Group 7: Future Directions - The team aims to further refine spatio-temporal reasoning data and post-training mechanisms to support question answering in longer videos and more complex scenarios [81]. - Open-o3 Video's open-source nature encourages community engagement and further exploration in the field of video multi-modal models [82].