通义大模型发布新一代端到端语音交互模型

Core Viewpoint - The official release of the Fun-Audio-Chat model by Tongyi Model represents a significant advancement in AI voice interaction, emphasizing its ability to understand speech, perceive emotions, and perform tasks effectively [1] Technical Performance - The new model utilizes an end-to-end S2S architecture that generates voice output directly from voice input, eliminating the need for multiple modules such as ASR, LLM, and TTS, resulting in higher efficiency and lower latency [1] - The Shared LLM layer operates at a frame rate of 5Hz for efficient processing, while the SRH generates high-quality speech at a frame rate of 25Hz, reducing GPU computational costs by nearly 50% [1] - Training content encompasses audio understanding, voice Q&A, emotion recognition, and tool invocation, making the model more practical and applicable to real-world scenarios [1]