Model Quantization
Search documents
首次将十亿参数三维模型塞进手机!4比特量化,速度2.5倍、内存降3.7倍、精度98%|ICLR'26
量子位· 2026-03-08 04:26
Core Insights - The article discusses the development of QuantVGGT, a quantization framework designed to effectively compress and accelerate the Visual Geometry Grounded Transformers (VGGT) model, which has over 1 billion parameters, while maintaining high accuracy and performance [2][5][58]. Group 1: Quantization Framework - QuantVGGT utilizes 4-bit quantization, achieving a speed increase of 2.5 times and a memory reduction of 3.7 times, while preserving 98% of the reconstruction accuracy compared to the full precision model [2][5][7]. - The framework introduces two main technical contributions: Dual-Smoothed Fine-Grained Quantization (DSFQ) and Noise-Filtered Diverse Sampling (NFDS) [5][9]. Group 2: Challenges in Quantization - VGGT's unique properties, such as the presence of data-independent special tokens and the inherent complexity of 3D data, pose significant challenges for quantization [11][12]. - The data-independent tokens lead to a heavy-tailed activation distribution, complicating the quantization process and increasing the risk of information loss [11][12]. Group 3: Technical Contributions - DSFQ combines pre-global Hadamard rotation and post-local channel smoothing to mitigate the heavy-tailed distribution and inter-channel variance issues [5][9][30]. - NFDS employs deep statistical information to filter out noise and create frame-aware diverse calibration clusters, ensuring the stability of the quantization range [5][9][40]. Group 4: Experimental Results - Extensive experiments demonstrate that QuantVGGT outperforms existing quantization methods across various benchmark datasets and bit widths, achieving optimal performance [5][13][59]. - In camera pose estimation tasks, QuantVGGT maintains 99.9% performance at 8-bit quantization and achieves an AUC@30 of 88.2 at 4-bit quantization, significantly outperforming other methods [47][50]. Group 5: Efficiency and Deployment - The proposed quantization framework shows minimal additional latency, with only a 0.2% increase in delay while significantly retaining model performance [56][58]. - The results indicate that QuantVGGT is well-suited for deployment in resource-constrained environments, demonstrating its practical advantages [5][58].
五年,终于等来Transformers v5
自动驾驶之心· 2025-12-04 03:03
Core Insights - The article discusses the release of Transformers v5.0.0rc0, marking a significant evolution in the AI infrastructure library after a five-year development cycle from v4 to v5 [3] - The update highlights the growth of the Transformers library, with daily downloads increasing from 20,000 to over 3 million and total installations surpassing 1.2 billion since the v4 release in November 2020 [3] - The new version focuses on four key dimensions: simplicity, transition from fine-tuning to pre-training, interoperability with high-performance inference engines, and making quantization a core feature [3] Simplification - The primary focus of the team is on simplicity, aiming for a clean and clear integration of models, which will enhance standardization, versatility, and community support [5][6] - The library has adopted a modular design approach, facilitating easier maintenance and faster integration, while promoting collaboration within the community [10] Model Updates - Transformers serves as a toolbox for model architectures, with the goal of including all the latest models and becoming the trusted source for model definitions [7] - Over the past five years, an average of 1-3 new models has been added weekly [8] Model Conversion Tools - Hugging Face is developing tools to identify similarities between new models and existing architectures, aiming to automate the model conversion process into the Transformers format [13][14] Training Enhancements - The v5 version emphasizes support for pre-training, with redesigned model initialization and broader compatibility with optimization operators [20] - Hugging Face continues to collaborate with fine-tuning tools in the Python ecosystem and is ensuring compatibility with tools in the JAX ecosystem [21] Inference Improvements - Inference is a key area of optimization in v5, with updates including dedicated kernels, cleaner default settings, new APIs, and enhanced support for inference engines [22][25] - The goal is not to replace specialized inference engines but to achieve compatibility with them [25] Local Deployment - The team collaborates with popular inference engines to ensure that models added to Transformers are immediately available and can leverage the advantages of these engines [27] - Hugging Face is also working on local inference capabilities, allowing models to run directly on devices, with expanding support for multimodal models [28] Quantization - Quantization is becoming a standard in modern model development, with many state-of-the-art models being released in low-precision formats such as 8-bit and 4-bit [29]