低比特量化
Search documents
NeurIPS 2025 Spotlight | NYU提出QSVD,仅数学压缩让模型更轻、更快、更稳
机器之心· 2025-11-15 09:23
Core Insights - The article discusses the development of QSVD, a novel framework for efficient compression of Vision-Language Models (VLM) that combines singular value decomposition (SVD) and quantization, aiming to reduce computational costs while maintaining model performance [3][29]. Group 1: Background and Motivation - Vision-Language Models (VLM) serve as a crucial engine connecting visual understanding and language generation, enabling applications like image description and visual question answering [2]. - The large parameter size of these models, often exceeding billions, leads to significant memory and computational demands, making practical deployment challenging [2][6]. Group 2: QSVD Framework - QSVD employs a unique approach of Joint SVD over Query-Key-Value (QKV) matrices, allowing for a unified low-rank approximation that reduces storage and computation requirements [10][24]. - The framework introduces Cross-layer Rank Allocation, which intelligently allocates ranks based on the importance of different layers, optimizing the compression process [13][14]. Group 3: Technical Innovations - QSVD integrates low-bit quantization and outlier smoothing techniques to enhance hardware efficiency and maintain high accuracy during the quantization process [15][18]. - The method significantly reduces memory usage by only caching a shared representation of K/V values, halving the memory footprint during inference [12][19]. Group 4: Experimental Results - The research team conducted evaluations on various models, including LLaVA-v1.5 and SmolVLM, demonstrating that QSVD achieves over 10% higher accuracy compared to existing methods like ASVD and SVD-LLM [20][22]. - The results indicate that QSVD not only compresses models but also enhances their intelligence, with inference speed improvements of up to 13 times [23][19]. Group 5: Conclusion and Future Directions - QSVD represents a significant advancement in the efficient compression of VLMs, focusing on self-attention layers to improve inference efficiency while minimizing accuracy loss [29]. - Future research aims to extend optimizations to cross-module joint compression and adaptive optimization, enhancing the deployability and accessibility of powerful models [29].
关于端侧大模型芯片化的若干趋势思考......
自动驾驶之心· 2025-10-23 00:04
Core Insights - The article discusses the evolution of algorithms in the chip design industry, particularly focusing on the advancements in attention mechanisms and their implications for future chip designs [2][4]. Group 1: Attention Mechanism Evolution - The Transformer architecture has dominated the large model field, but its self-attention mechanism poses significant computational challenges, especially in terms of power requirements during the prefill and decode phases [4]. - Various improvements to the Transformer structure have been proposed, such as Performer, Reformer, and lnformer, but none have achieved widespread application due to a lack of strong demand [4]. - The emergence of linear attention mechanisms aims to reduce computational complexity to linear levels, with models like RWKV and Mamba following this approach [5]. Group 2: Dynamic Sparsity and MoE Technology - Dynamic sparsity, particularly through Mixture of Experts (MoE) technology, has gained traction, allowing only a subset of experts to be activated during inference, which can lead to better performance and reduced computational costs [8]. - The trend towards increased sparsity in MoE models, such as Ant Group's recent models, indicates a significant shift in the industry, necessitating larger memory and bandwidth requirements [9]. Group 3: Low-Bit Quantization - The introduction of low-bit quantization techniques, such as FP8 training, has opened new avenues for model efficiency, with a focus on weight-only quantization to alleviate bandwidth bottlenecks [11]. - The article highlights the importance of fine-grained quantization and the potential for mixed quantization strategies to optimize model performance, especially in MoE models [12]. Group 4: Token Compression - Token compression has emerged as a critical area for reducing the computational burden of large models, particularly in visual token processing, which has shown high redundancy [14]. - The article notes a surge in research focused on token compression techniques, which could significantly impact chip design by lowering application barriers for large models [14]. Group 5: Future Implications for Chip Design - The advancements in attention mechanisms, dynamic sparsity, low-bit quantization, and token compression are expected to have substantial implications for the design of future edge chips, which have lagged behind the development of large models [14].