小红书开源多模态大模型dots.vlm1:解锁图文理解与数学解题新能力
Sou Hu Cai Jing·2025-08-07 10:31

Core Insights - Xiaohongshu's hi lab has open-sourced its latest multimodal model, dots.vlm1, which is built on DeepSeek V3 and features a self-developed 1.2 billion parameter visual encoder, NaViT, showcasing strong multimodal understanding and reasoning capabilities [1][6] - Dots.vlm1 has demonstrated performance close to leading models like Gemini 2.5 Pro and Seed-VL1.5 in various visual evaluation benchmarks, particularly excelling in tasks such as MMMU, MathVision, and OCR Reasoning [1][4] Model Performance - In text reasoning tasks, dots.vlm1 performs comparably to DeepSeek-R1-0528, indicating a degree of generality in mathematical and coding capabilities, although there is room for improvement in more diverse reasoning tasks like GPQA [4] - Dots.vlm1's overall performance is notable, especially in visual multimodal capabilities, nearing state-of-the-art levels [4] Benchmark Comparisons - Dots.vlm1's performance metrics in various benchmarks include: - MMMU: 80.11 - MathVision: 69.64 - OCR Reasoning: 66.23 - General Visual tasks: 90.85 in m3gia(cn) [5] Model Architecture - Dots.vlm1 consists of three core components: a 1.2 billion parameter NaViT visual encoder, a lightweight MLP adapter, and the DeepSeek V3 MoE large language model [5] - The training process involved three stages: pre-training of the visual encoder, pre-training of the VLM, and post-training of the VLM, enhancing the model's perception and generalization capabilities [5] Open Source and Future Plans - Dots.vlm1 has been uploaded to the open-source platform Hugging Face, allowing users to experience the model for free [6] - Hi lab plans to enhance the model's performance by expanding the scale and diversity of cross-modal translation data, improving the visual encoder structure, and exploring more effective neural network architectures and loss function designs [6]