可信智能系统
Search documents
NeurIPS 2025 Spotlight | NYU提出QSVD,仅数学压缩让模型更轻、更快、更稳
机器之心· 2025-11-15 09:23
Core Insights - The article discusses the development of QSVD, a novel framework for efficient compression of Vision-Language Models (VLM) that combines singular value decomposition (SVD) and quantization, aiming to reduce computational costs while maintaining model performance [3][29]. Group 1: Background and Motivation - Vision-Language Models (VLM) serve as a crucial engine connecting visual understanding and language generation, enabling applications like image description and visual question answering [2]. - The large parameter size of these models, often exceeding billions, leads to significant memory and computational demands, making practical deployment challenging [2][6]. Group 2: QSVD Framework - QSVD employs a unique approach of Joint SVD over Query-Key-Value (QKV) matrices, allowing for a unified low-rank approximation that reduces storage and computation requirements [10][24]. - The framework introduces Cross-layer Rank Allocation, which intelligently allocates ranks based on the importance of different layers, optimizing the compression process [13][14]. Group 3: Technical Innovations - QSVD integrates low-bit quantization and outlier smoothing techniques to enhance hardware efficiency and maintain high accuracy during the quantization process [15][18]. - The method significantly reduces memory usage by only caching a shared representation of K/V values, halving the memory footprint during inference [12][19]. Group 4: Experimental Results - The research team conducted evaluations on various models, including LLaVA-v1.5 and SmolVLM, demonstrating that QSVD achieves over 10% higher accuracy compared to existing methods like ASVD and SVD-LLM [20][22]. - The results indicate that QSVD not only compresses models but also enhances their intelligence, with inference speed improvements of up to 13 times [23][19]. Group 5: Conclusion and Future Directions - QSVD represents a significant advancement in the efficient compression of VLMs, focusing on self-attention layers to improve inference efficiency while minimizing accuracy loss [29]. - Future research aims to extend optimizations to cross-module joint compression and adaptive optimization, enhancing the deployability and accessibility of powerful models [29].