Workflow
开源多模态范式
icon
Search documents
LLaVA-OneVision-1.5全流程开源,8B模型预训练只需4天、1.6万美元
机器之心· 2025-10-13 06:37
Core Insights - LLaVA represents a significant milestone in democratizing multimodal capabilities by efficiently aligning open-source visual encoders with large language models, enabling a "see - understand - converse" approach in an open ecosystem [2][5]. Group 1: LLaVA Development and Features - LLaVA-1.5 enhances understanding through larger and cleaner datasets and high-resolution inputs, while LLaVA-NeXT expands capabilities in OCR, mathematics, and multi-scenario tasks [5]. - The LLaVA-OneVision framework integrates various modalities, including images, documents, charts, and videos, ensuring both effectiveness and efficiency [5][7]. - The framework emphasizes the importance of reproducibility in open-source paths, highlighting the gap between merely open weights and fully reproducible models [5][6]. Group 2: Performance Metrics - LLaVA-OV-1.5 outperforms Qwen2.5-VL in several benchmarks, showcasing competitive or superior performance across various multimodal tasks [7][25]. - The average performance metrics across different benchmarks indicate LLaVA-OV-1.5's strong capabilities, with notable scores in areas such as General VQA and OCR & Chart tasks [6][19]. Group 3: Data and Training Strategies - The training strategy involves a three-stage process: language-image alignment, high-quality knowledge injection, and visual instruction alignment, utilizing a total of 85 million pre-training samples and 22 million instruction samples [20][25]. - The data construction emphasizes a concept balancing strategy to mitigate issues related to sparse long-tail concepts and noise in original captions, significantly improving performance metrics [12][13]. - Offline parallel data packaging is employed to enhance token utilization and reduce padding waste, achieving up to an 11x reduction in padding tokens [21][22]. Group 4: Engineering Optimizations - The model leverages mixed parallelism and native resolution strategies to optimize training efficiency and maintain structural details in dense text areas [23][24]. - The entire process is designed to be straightforward for reproduction, with all data, tools, scripts, and configurations made openly available, ensuring clarity and ease of use [26].