Core Viewpoint - Keye-VL 1.5, an advanced multimodal model developed by Kuaishou, has been open-sourced, showcasing significant improvements in video understanding and reasoning capabilities compared to its predecessor [1][4][6]. Group 1: Model Capabilities - Keye-VL 1.5 features enhanced temporal localization abilities, achieving 0.1-second precision in identifying when specific objects appear in videos [10][8]. - The model introduces a Slow-Fast dual encoding mechanism, allowing for a 128k long context window while balancing speed and detail [5][8]. - It has demonstrated superior performance in various benchmarks, scoring 73.0 in the Video-MME short video benchmark and leading in multiple other evaluation scenarios [6][18]. Group 2: Benchmark Performance - Keye-VL 1.5 outperforms other models such as Qwen2.5-VL 7B across several benchmarks, including OpenCompass and MMBench, achieving top scores in its class [19][21]. - In human-annotated metrics, Keye-VL 1.5 achieved an average score of 3.53, which is an improvement of 0.51 points over the preview version and surpasses competing models [24][25]. Group 3: Model Architecture - The architecture of Keye-VL 1.5 consists of a "Vision Transformer (ViT) + MLP projector + language decoder" structure, designed to capture global spatial relationships in video frames [27][28]. - The model employs a four-stage progressive pre-training pipeline, utilizing over 1 trillion tokens from diverse data sources to enhance its multimodal capabilities [39][41]. Group 4: Research and Development - The Keye team has made significant contributions to the field, presenting multiple research findings at top conferences, including advancements in multimodal reinforcement learning and visual language model governance frameworks [51][54]. - The team's efforts are focused on integrating visual, linguistic, and behavioral data to enhance cognitive and decision-making processes in AI applications [50].
视频理解新标杆,快手多模态推理模型开源:128k上下文+0.1秒级视频定位+跨模态推理