Workflow
解耦的视觉 - 语言部署(DvD)
icon
Search documents
InternVL 3.5来了!上海AI Lab最新开源:硬刚 GPT-5 还把效率玩明白
自动驾驶之心· 2025-08-27 23:33
Core Viewpoint - Shanghai AI Lab has launched the open-source multimodal model InternVL 3.5, which significantly advances the performance of the InternVL series in terms of generality, reasoning ability, and inference efficiency compared to its predecessors [2]. Model Architecture - InternVL 3.5 consists of three core components: a dynamic high-resolution text tokenizer, an InternViT visual encoder, and a connector that integrates visual and language modalities [5]. - The model employs a two-stage training paradigm, including a large-scale pre-training phase and a multi-stage post-training phase [5][6]. Training Objectives - The pre-training phase utilizes a large-scale multimodal corpus to learn general visual-language representations, with a total of approximately 1.16 billion samples corresponding to about 250 billion tokens [7]. - The post-training strategy includes three stages: Supervised Fine-Tuning (SFT), Cascade Reinforcement Learning (Cascade RL), and Visual Consistency Learning (ViCO) [9]. Performance Metrics - InternVL 3.5 has shown superior performance across various benchmarks, achieving notable scores in tasks such as MMStar, MMVet, and MMBench V1.1 [14]. - The model's performance is competitive with top commercial models like GPT-5, demonstrating significant improvements in multimodal reasoning and mathematical tasks [14][15]. Testing and Deployment - The model incorporates a test-time scaling method to enhance reasoning capabilities, particularly for complex tasks requiring multi-step reasoning [11]. - The Decoupled Vision-Language Deployment (DvD) framework optimizes hardware costs and facilitates seamless integration of new modules without modifying the language server deployment [12].