Core Insights - Huawei's Noah's Ark Lab has developed a new inference acceleration framework called Vision-Aware Speculative Decoding (ViSpec), achieving up to 3.22 times acceleration for visual language models (VLM) without sacrificing generation quality [3][8][25]. Group 1: Current Challenges in VLM - Speculative decoding has become a standard method for accelerating large language model (LLM) inference, but its application in VLM has been limited, with existing methods achieving less than 1.5 times acceleration [2][4]. - The primary challenge lies in processing visual information, where VLMs convert images into numerous "visual tokens," leading to inefficiencies in draft models [6][4]. Group 2: ViSpec Framework Innovations - ViSpec introduces a lightweight visual adapter that efficiently compresses image embeddings into compact visual representations, significantly improving the draft model's decision-making efficiency [9][11][12]. - A global visual feature injection mechanism is implemented to maintain the influence of visual context throughout the text generation process, effectively overcoming the "lost-in-the-middle" issue [13][15][17]. - The team developed a novel data generation method to create high-quality datasets for training, enabling the generation of longer and more detailed responses [18][20]. Group 3: Experimental Results - Extensive experiments on various mainstream VLMs, including LLaVA-1.6 and Qwen2.5-VL, demonstrated that ViSpec achieved acceleration rates ranging from 1.85 to 3.22 times, with an average acceleration exceeding 2.5 times [22][24]. - ViSpec's performance was validated through ablation studies, showing that each component contributes significantly to the overall acceleration, with image embedding compression alone providing a 30% performance boost [26][27][28]. Group 4: Future Outlook - The introduction of ViSpec marks a significant advancement in VLM inference acceleration, paving the way for practical applications in edge devices such as smartphones and smart homes, enhancing human-machine interaction [29][30].
多模态推理最高加速3.2倍!华为诺亚新算法入选NeurIPS 2025