CVPR2026 | Streamo：让大模型变成实时流式交互助手

Core Insights - The article discusses the limitations of current video large models in real-time interactive scenarios, highlighting the need for a solution that can handle unbounded video streams and determine the timing of responses effectively [4][6][19] - Streamo, developed by Hong Kong Baptist University in collaboration with Tencent Youtu Lab, introduces a novel approach by integrating decision-making and content generation into a unified end-to-end training framework [2][7][19] Problem Analysis - Current video large models, such as Qwen2-VL and LLaVA-Video, excel in offline scenarios but struggle with real-time interactions due to their reliance on complete video segments for inference [4][6] - Real-world streaming scenarios require models to make immediate judgments based on current frames without the ability to "see the future," complicating the response timing [4][6] Streamo Framework - Streamo innovatively transforms the question of "when to respond" into a token that the model predicts, organizing streaming video into a multi-turn dialogue format [9][10] - The model predicts response states such as , , and at each second, allowing it to determine when to generate output based on the evolving context [9][10] Training Data and Methodology - The training dataset, Streamo-Instruct-465K, consists of approximately 465,000 instruction samples from 135,875 video segments, designed to provide clear temporal boundaries for model responses [12][13] - This dataset supports various tasks, including real-time narration, event captioning, and time-sensitive question answering, all under a unified temporal supervision framework [13][14] Experimental Results - Streamo-7B outperformed the baseline model Dispider by 13.83 percentage points on OVO-Bench, demonstrating superior real-time perception, backward tracing, and forward active responding capabilities [16] - The model showed a 4.66% performance improvement when evaluated at 2fps after being trained at 1fps, indicating strong generalization ability [16] Conclusion - Streamo addresses critical bottlenecks in current video large models, providing a reusable technical pathway to convert static perception models into dynamic interactive agents [19] - The framework enhances the accuracy and coherence of responses in real-time scenarios, paving the way for advancements in streaming video understanding [20]