Workflow
视觉基础模型
icon
Search documents
内存占用最高降低75%,美国能源部科学家提出跨通道分层聚合方法D-CHAG,实现极大规模模型多通道数据集运行
3 6 Ke· 2026-02-11 09:17
Core Insights - The article discusses the introduction of a distributed cross-channel hierarchical aggregation method (D-CHAG) by scientists at the Oak Ridge National Laboratory, aimed at enhancing the processing of large-scale models on multi-channel datasets [1][2][4] Group 1: Methodology and Performance - D-CHAG enables distributed processing of the tokenization process and employs a hierarchical strategy for channel aggregation, allowing large-scale models to operate efficiently on multi-channel datasets [2][4] - In evaluations on hyperspectral imaging and weather prediction tasks, D-CHAG demonstrated a memory usage reduction of up to 75% when combined with tensor parallelism and model sharding on the Frontier supercomputer, achieving over 2x throughput improvement on up to 1,024 AMD GPUs [2][4] - The method addresses memory bottlenecks and computational efficiency issues in multi-channel foundational model training, achieving up to 70% memory reduction compared to using only tensor parallelism [4] Group 2: Data Utilization - The research utilized two typical multi-channel datasets: hyperspectral images of poplar trees, containing 494 images with 500 spectral channels, and the ERA5 high-resolution reanalysis dataset for weather prediction, which included 80 input channels derived from various atmospheric and surface variables [5][6] - The hyperspectral dataset is crucial for biomass research and plant phenotyping, while the weather prediction dataset was adapted for model training through regridding techniques [5][6] Group 3: Technical Advantages - D-CHAG combines distributed tokenization and hierarchical channel aggregation, reducing memory usage per cross-channel attention layer by processing fewer channels at each layer [9][11] - The method allows for efficient training of larger models on high-channel datasets, supporting configurations that were previously unmanageable with standard tensor parallelism [25] Group 4: Comparative Analysis - The performance of D-CHAG was compared against baseline methods, showing consistent training loss in hyperspectral image applications and significant improvements in weather prediction tasks across various metrics [20][21] - For models with 1.7 billion parameters, D-CHAG configurations demonstrated performance enhancements of approximately 60% for 1,024-channel data, while maintaining efficiency for 512-channel data [15][25]
视觉模型既懂语义,又能还原细节,南洋理工&商汤提出棱镜假说
机器之心· 2026-01-13 10:04
Core Insights - The article introduces the Prism Hypothesis and Unified Autoencoding (UAE), aiming to harmonize semantic and pixel representations by addressing the conflict between semantic understanding and detail reconstruction [2][5][10]. Background - The challenge of achieving both semantic understanding and detail restoration in visual foundational models is highlighted, as many systems are forced to combine two separate representations, leading to decreased training efficiency and interference [3][4]. Key Concepts - The Prism Hypothesis posits that the representation of information in the world must allow for both shared semantics and the retention of fine-grained details [4][5]. - Semantic encoders (e.g., DINOv2, CLIP) excel in abstract information, while pixel encoders (e.g., SD series VAE) are better at reconstructing details like textures and edges [5][10]. Methodology - Unified Autoencoding (UAE) aims to synthesize both representations by structuring the learning of multi-frequency latent variables, separating the roles of semantics and details [11][13]. - The method involves: 1. **Unified Encoder**: Initialized from a semantic model to transition into a unified latent space [14]. 2. **Residual Split Flow**: Employs FFT for frequency band projection and iterative residual splitting to decompose latent variables into multiple frequency bands [15]. 3. **Frequency Band Modulator**: Perturbs only the high-frequency details and integrates them for the decoder [16]. 4. **Semantic-wise Loss**: Applies semantic constraints only to the lowest frequency bands, allowing for detail learning in higher frequencies [17]. Experimental Results - UAE demonstrates superior reconstruction quality on ImageNet and MS-COCO datasets, achieving PSNR=33.08, SSIM=0.94, and rFID=0.16 on ImageNet, and PSNR=32.84, SSIM=0.94, and rFID=0.17 on MS-COCO [19][20]. - Compared to the RAE baseline, UAE shows higher PSNR/SSIM and a reduction in rFID by over 90% [20]. - In conditional generation tasks on ImageNet, UAE achieves gFID=1.68 and IS=301.6 [25]. - For semantic understanding, UAE reaches a Top-1 accuracy of 83.0% on ImageNet-1K, matching RAE's performance [26][27].
ICCV 2025 Highlight | 3D真值生成新范式,开放驾驶场景的语义Occupancy自动化标注!
机器之心· 2025-08-29 00:15
Core Viewpoint - The article presents AutoOcc, an innovative framework for automatic open-ended 3D semantic occupancy annotation that surpasses existing methods without requiring human labeling, showcasing excellent generalization capabilities [5][11][26]. Summary by Sections Introduction - AutoOcc is developed by the VDIG laboratory at Peking University, led by researchers Zhou Xiaoyu and Wang Yongtao, and has been recognized in top conferences and competitions in the computer vision field [2][4]. Problem Statement - The challenge of generating accurate and complete semantic occupancy annotations from raw sensor data at low cost remains significant in the fields of autonomous driving and embodied intelligence [5][8]. Methodology - AutoOcc utilizes a vision-language model (VLM) to create semantic attention maps for scene description and dynamically expands the semantic list, while a self-estimating optical flow module identifies and processes dynamic objects in temporal rendering [5][11][17]. Key Innovations - The framework introduces a 3D Gaussian representation (VL-GS) that effectively models complete 3D geometry and semantics in driving scenarios, demonstrating superior representation efficiency, accuracy, and perception capabilities [6][17]. Experimental Results - Extensive experiments indicate that AutoOcc outperforms existing automated 3D semantic occupancy annotation methods and exhibits remarkable zero-shot generalization across datasets [7][21][22]. Comparison with Existing Methods - AutoOcc is compared with traditional methods that rely on human labeling and extensive post-processing, highlighting its speed and open-ended semantic annotation capabilities [14][21]. Performance Metrics - The framework shows significant advantages in terms of robustness and open semantic labeling ability, achieving state-of-the-art performance in both specific semantic categories and across datasets [20][21]. Efficiency Evaluation - AutoOcc demonstrates a notable reduction in computational costs while enhancing annotation performance, achieving a balance between efficiency and flexibility without relying on human annotations [24][25]. Conclusion - The article concludes that AutoOcc represents a significant advancement in automated open semantic 3D occupancy annotation, integrating visual language model guidance with differentiable 3D Gaussian techniques [26].
大模型究竟是个啥?都有哪些技术领域,面向小白的深度好文!
自动驾驶之心· 2025-08-05 23:32
Core Insights - The article provides a comprehensive overview of large language models (LLMs), their definitions, architectures, capabilities, and notable developments in the field [3][6][12]. Group 1: Definition and Characteristics of LLMs - Large Language Models (LLMs) are deep learning models trained on vast amounts of text data, capable of understanding and generating natural language [3][6]. - Key features of modern LLMs include large-scale parameters (e.g., GPT-3 with 175 billion parameters), Transformer architecture, pre-training followed by fine-tuning, and multi-task adaptability [6][12]. Group 2: LLM Development and Architecture - The Transformer architecture, introduced by Google in 2017, is the foundational technology for LLMs, consisting of an encoder and decoder [9]. - Encoder-only architectures, like BERT, excel in text understanding tasks, while decoder-only architectures, such as GPT, are optimized for text generation [10][11]. Group 3: Core Capabilities of LLMs - LLMs can generate coherent text, assist in coding, answer factual questions, and perform multi-step reasoning [12][13]. - They also excel in text understanding and conversion tasks, such as summarization and sentiment analysis [13]. Group 4: Notable LLMs and Their Features - The GPT series by OpenAI is a key player in LLM development, known for its strong general capabilities and continuous innovation [15][16]. - Meta's Llama series emphasizes open-source development and multi-modal capabilities, significantly impacting the AI community [17][18]. - Alibaba's Qwen series focuses on comprehensive open-source models with strong support for Chinese and multi-language tasks [18]. Group 5: Visual Foundation Models - Visual Foundation Models are essential for processing visual inputs, enabling the connection between visual data and LLMs [25]. - They utilize architectures like Vision Transformers (ViT) and hybrid models combining CNNs and Transformers for various tasks, including image classification and cross-modal understanding [26][27]. Group 6: Speech Large Models - Speech large models are designed to handle various speech-related tasks, leveraging large-scale speech data for training [31]. - They primarily use Transformer architectures to capture long-range dependencies in speech data, facilitating tasks like speech recognition and translation [32][36]. Group 7: Multi-Modal Large Models (MLLMs) - Multi-modal large models can process and understand multiple types of data, such as text, images, and audio, enabling complex interactions [39]. - Their architecture typically includes pre-trained modal encoders, a large language model, and a modal decoder for generating outputs [40]. Group 8: Reasoning Large Models - Reasoning large models enhance the reasoning capabilities of LLMs through optimized prompting and external knowledge integration [43][44]. - They focus on improving the accuracy and controllability of complex tasks without fundamentally altering the model structure [45].
巧妙!一个传统技术让国产视觉基础模型直接上大分
量子位· 2025-05-23 06:14
Core Viewpoint - The article highlights the significant advancements in domestic AI, particularly focusing on the Glint-MVT model developed by GeLing Deep Vision, which demonstrates superior performance in visual foundational models compared to international counterparts like CLIP and OpenCLIP [1][2]. Performance Evaluation - The linear probing technique was used to assess the pre-trained model's effectiveness, showing that the domestic visual foundational model achieved an average accuracy rate 2.3% higher than OpenCLIP and 1.1% higher than CLIP across 26 classification test sets [2]. Application Effectiveness - The Glint-MVT model excels in downstream tasks such as image understanding and segmentation, showcasing its ability to accurately segment complex images and identify objects even when partially obscured [4][8][12]. Technical Innovations - The Glint-MVT model incorporates a Margin-based pretrained Vision Transformer (MVT) and introduces the Margin Softmax loss function, which enhances the model's generalization ability by reducing data noise impact [13][26]. - The model utilizes virtual category construction by clustering large datasets, such as LAION 400M, into one million virtual categories, improving data scale efficiency [28]. Model Variants - The Glint-RefSeg model, built on Glint-MVT, achieves state-of-the-art (SOTA) performance in referring expression segmentation without the need for extensive training data [14]. - The MVT-VLM model demonstrates strong capabilities in image understanding, accurately identifying details such as the color and number on athletes' jerseys [15][16]. Broader Applications - Glint-RefSeg is also applicable in video segmentation, maintaining accuracy even with dynamic scenes, as demonstrated in a video of Bruno Mars [19][21]. - The model's versatility extends to embodied intelligence scenarios, effectively answering contextual questions about object placements [22][25]. Company Development - GeLing Deep Vision has been a pioneer in computer vision since 2013, focusing on practical applications of AI technology to address industry pain points, as exemplified by the Glint-MVT model [36][37]. - The company emphasizes a balance between technical innovation and practical application, avoiding the pursuit of mere academic metrics [38][39]. Community Engagement - GeLing Deep Vision adopts an open-source approach while maintaining a focus on innovation, aiming to foster a collaborative ecosystem that encourages community contributions [40]. - The leadership, including the director of the algorithm research institute, emphasizes the importance of youthful thinking and practical experience in driving technological advancements [41][42]. Industry Perspective - The article suggests that the development of AI technology is transitioning from general exploration to specialized applications, with companies like GeLing Deep Vision playing a crucial role in this evolution [44].