视听分离SOTA提速6倍，清华发布首个6M高性能模型

Core Insights - Tsinghua University's Dolphin model breaks the "high performance must mean high energy consumption" bottleneck by using only 6 million parameters, achieving a speed increase of over 6 times while maintaining high-quality audio-visual speech separation [1][2][14] Group 1: Model Innovations - Dolphin introduces a novel dual-path discrete visual encoder called DP-LipCoder, which utilizes vector quantization to achieve high-quality visual semantics while being lightweight [4][7] - The model employs a Global-Local Attention (GLA) module that allows for efficient global and local feature modeling in a single forward pass, eliminating the need for time-consuming iterative processes [8][10] - Dolphin uses a direct feature regression mechanism instead of traditional masking strategies, enhancing signal fidelity and achieving a significant improvement in the SI-SNRi metric [10] Group 2: Performance Metrics - Dolphin outperforms existing state-of-the-art (SOTA) models in multiple benchmark datasets, achieving a SI-SNRi of 16.8 dB on the LRS2 dataset, surpassing IIANet and AV-Mossformer2 [11][14] - The model's total parameter count is only 6.22 million, which is over 50% less than IIANet's 15.01 million, while also demonstrating a GPU inference latency of just 33.24 milliseconds for 1 second of audio, making it significantly faster than competitors [14] - In subjective listening tests, Dolphin received a mean opinion score (MOS) of 3.86, indicating superior audio clarity and naturalness compared to other models [14] Group 3: Application Potential - The advancements in Dolphin's technology provide a new pathway for deploying high-precision speech separation in resource-constrained environments such as smart glasses and mobile devices [13]