快手团队发布8B Kwai Keye-VL！技术报告速递~

Core Insights - The article discusses the launch of Kwai Keye-VL, an 8 billion parameter multimodal large language model (MLLM) designed to enhance understanding of short video content, addressing the limitations of existing models in processing dynamic and information-dense media [2][3]. Group 1: Model Development - Kwai Keye-VL is built on a large-scale dataset containing over 600 billion tokens, primarily focused on high-quality video data, and employs an innovative training strategy [2][4]. - The training process consists of a four-stage pre-training phase followed by a two-stage post-training phase, aimed at aligning visual and language features effectively [4][18]. Group 2: Training Methodology - The first stage of training focuses on optimizing basic capabilities such as instruction following through supervised fine-tuning and mixed preference optimization [5]. - The second stage enhances reasoning abilities using a five-mode "cold start" data mixing strategy, which includes various reasoning tasks and high-quality video data [6][12]. Group 3: Performance Evaluation - Keye-VL has demonstrated advanced performance in public benchmark tests, outperforming other leading models of similar size in user experience evaluations [3][27]. - The model's capabilities were validated through extensive evaluation experiments, including the development of a new benchmark, KC-MMBench, tailored for real-world short video scenarios [3][28]. Group 4: Technical Innovations - The model incorporates a hybrid parallelism strategy for efficient training, combining data and sequence parallelism to optimize memory usage and computational efficiency [22][23]. - A dynamic load balancing mechanism is implemented to address computational load imbalances during multimodal training, significantly improving training speed [24]. - A sample-level auto-resume mechanism enhances training stability by allowing automatic recovery from interruptions [25].