通义端到端语音交互模型Fun-Audio-Chat发布

Core Insights - Tongyi released a new end-to-end voice interaction model called Fun-Audio-Chat, which emphasizes "voice-to-voice" interaction capabilities, allowing users to engage in multi-turn conversations directly through voice [1] - The model achieved leading performance in various speech and multimodal evaluations, surpassing several other models of similar parameter scale, indicating its strong capabilities in speech understanding, generation, and dialogue collaboration [1][2] Model Features - Fun-Audio-Chat-8B is part of the Tongyi Bailing voice model family, which previously included speech-to-text and text-to-speech models. Unlike its predecessors, this model focuses on end-to-end voice interaction for applications such as voice chatting, emotional companionship, smart terminal interaction, and voice customer service [1] - The model employs a two-stage training strategy called Core-Cocktail, which integrates speech and multimodal capabilities while fine-tuning existing language model parameters to mitigate the "catastrophic forgetting" issue [2] - It also incorporates multi-stage, multi-task preference alignment training to enhance the model's ability to accurately capture semantic and emotional cues in real voice conversations, improving the naturalness of dialogue [2] Efficiency and Practicality - Fun-Audio-Chat-8B features a dual-resolution end-to-end architecture that compresses, autoregresses, and decompresses audio, reducing the audio frame rate to approximately 5Hz. This design saves nearly 50% of GPU computing costs while maintaining speech quality, which is significant given the high computational costs associated with current speech models [2] - The open-sourcing of Fun-Audio-Chat-8B signifies a move towards practical applications of large speech models in real-world scenarios, emphasizing low computational power and strong dialogue capabilities [2]