多模态对齐 - filings, earnings calls, financial reports, news

多模态对齐

Search documents

LLaVA-OneVision-1.5全流程开源，8B模型预训练只需4天、1.6万美元

机器之心· 2025-10-13 06:37

Core Insights - LLaVA represents a significant milestone in democratizing multimodal capabilities by efficiently aligning open-source visual encoders with large language models, enabling a "see - understand - converse" approach in an open ecosystem [2][5]. Group 1: LLaVA Development and Features - LLaVA-1.5 enhances understanding through larger and cleaner datasets and high-resolution inputs, while LLaVA-NeXT expands capabilities in OCR, mathematics, and multi-scenario tasks [5]. - The LLaVA-OneVision framework integrates various modalities, including images, documents, charts, and videos, ensuring both effectiveness and efficiency [5][7]. - The framework emphasizes the importance of reproducibility in open-source paths, highlighting the gap between merely open weights and fully reproducible models [5][6]. Group 2: Performance Metrics - LLaVA-OV-1.5 outperforms Qwen2.5-VL in several benchmarks, showcasing competitive or superior performance across various multimodal tasks [7][25]. - The average performance metrics across different benchmarks indicate LLaVA-OV-1.5's strong capabilities, with notable scores in areas such as General VQA and OCR & Chart tasks [6][19]. Group 3: Data and Training Strategies - The training strategy involves a three-stage process: language-image alignment, high-quality knowledge injection, and visual instruction alignment, utilizing a total of 85 million pre-training samples and 22 million instruction samples [20][25]. - The data construction emphasizes a concept balancing strategy to mitigate issues related to sparse long-tail concepts and noise in original captions, significantly improving performance metrics [12][13]. - Offline parallel data packaging is employed to enhance token utilization and reduce padding waste, achieving up to an 11x reduction in padding tokens [21][22]. Group 4: Engineering Optimizations - The model leverages mixed parallelism and native resolution strategies to optimize training efficiency and maintain structural details in dense text areas [23][24]. - The entire process is designed to be straightforward for reproduction, with all data, tools, scripts, and configurations made openly available, ensuring clarity and ease of use [26].

Insta360最新全景综述：全景视觉的挑战、方法与未来

机器之心· 2025-10-04 03:38

Core Insights - The article discusses the transition from perspective vision to panoramic vision, highlighting the "perspective-panorama gap" as a central theme for understanding the challenges and opportunities in this field [6][19]. - It emphasizes the need for a systematic upgrade across data, models, and applications to enhance the usability of panoramic vision technologies [16][19]. Research Background and Motivation - The paper titled "One Flight Over the Gap: A Survey from Perspective to Panoramic Vision" aims to systematically analyze the differences between perspective and panoramic vision, covering over 300 papers and 20 representative tasks [4][19]. - The article provides a comprehensive overview of the challenges faced in panoramic vision, which are categorized into three main gaps: geometric distortion, non-uniform sampling, and boundary continuity [6][9]. Strategies Overview - Four main strategies are identified for adapting tasks to panoramic vision: 1. **Geometric Distortion**: Issues arise when spherical images are projected onto a plane, leading to shape distortion [7]. 2. **Non-uniform Sampling**: Pixel density varies significantly across different regions, affecting resolution [7]. 3. **Boundary Continuity**: The separation of boundaries in 2D images can lead to learning continuity issues [7]. - The article outlines a cross-method comparison to clarify the applicability of different strategies to various tasks [9][15]. Task Toolbox - The article lists over 20 tasks categorized into four main areas: enhancement and assessment, understanding, multi-modal, and generation, along with representative methods and key papers for each task [12][15]. - It highlights the rapid emergence of new paradigms such as diffusion and generative models, particularly in text-to-image/video and novel view synthesis [15]. Future Directions - To transition from "usable" to "user-friendly," advancements must be made in three main areas: data, model paradigms, and downstream applications [16][21]. - Key challenges include: 1. **Data Bottlenecks**: Lack of large-scale, diverse, and high-quality 360° datasets limits general training and reproducible evaluation [21]. 2. **Model Paradigms**: The need for robust models that can adapt from perspective to panoramic vision while maintaining performance across various tasks [21]. 3. **Downstream Applications**: Applications in spatial intelligence, XR, 3D reconstruction, and various industry sectors require effective deployment and compliance [21][22].

2025年中国多模态大模型行业核心技术现状关键在表征、翻译、对齐、融合、协同技术【组图】

Qian Zhan Wang· 2025-06-03 05:12

Core Insights - The article discusses the core technologies of multimodal large models, focusing on representation learning, translation, alignment, fusion, and collaborative learning [1][2][7][11][14]. Representation Learning - Representation learning is fundamental for multimodal tasks, addressing challenges such as combining heterogeneous data and handling varying noise levels across different modalities [1]. - Prior to the advent of Transformers, different modalities required distinct representation learning models, such as CNNs for computer vision (CV) and LSTMs for natural language processing (NLP) [1]. - The emergence of Transformers has enabled the unification of multiple modalities and cross-modal tasks, leading to a surge in multimodal pre-training models post-2019 [1]. Translation - Cross-modal translation aims to map source modalities to target modalities, such as generating descriptive sentences from images or vice versa [2]. - The use of syntactic templates allows for structured predictions, where specific words are filled in based on detected attributes [2]. - Encoder-decoder architectures are employed to encode source modality data into latent features, which are then decoded to generate the target modality [2]. Alignment - Alignment is crucial in multimodal learning, focusing on establishing correspondences between different data modalities to enhance understanding of complex scenarios [7]. - Explicit alignment involves categorizing instances with multiple components and measuring similarity, utilizing both unsupervised and supervised methods [7][8]. - Implicit alignment leverages latent representations for tasks without strict alignment, improving performance in applications like visual question answering (VQA) and machine translation [8]. Fusion - Fusion combines multimodal data or features for unified analysis and decision-making, enhancing task performance by integrating information from various modalities [11]. - Early fusion merges features at the feature level, while late fusion combines outputs at the decision level, with hybrid fusion incorporating both approaches [11][12]. - The choice of fusion method depends on the task and data, with neural networks becoming a popular approach for multimodal fusion [12]. Collaborative Learning - Collaborative learning utilizes data from one modality to enhance the model of another modality, categorized into parallel, non-parallel, and hybrid methods [14][15]. - Parallel learning requires direct associations between observations from different modalities, while non-parallel learning relies on overlapping categories [15]. - Hybrid methods connect modalities through shared datasets, allowing one modality to influence the training of another, applicable across various tasks [15].