Qwen3-TTS全家桶开源上线

Core Insights - Qwen3-TTS is an open-source voice generation model developed by Qwen, offering two model sizes: 1.7B for extreme performance and control, and 0.6B for a balance of performance and efficiency, supporting 10 major languages and various dialects [1] - The model features capabilities for voice cloning, voice creation, and high-quality human-like voice generation, driven by natural language instructions to flexibly control acoustic attributes such as tone, emotion, and rhythm [1] - It utilizes an innovative Dual-Track hybrid streaming generation architecture, allowing for both streaming and non-streaming generation, with an end-to-end synthesis delay as low as 97ms, catering to real-time interaction needs [1] Performance Metrics - Qwen3-TTS-VoiceDesign surpasses MiniMax-Voice-Design and other open-source models in instruction-following capability and expressiveness in the InstructTTS-Eval [2] - Qwen3-TTS-Instruct demonstrates single-speaker multilingual generalization with an average word error rate of 2.34%, maintaining style control at 75.4% in InstructTTS-Eval, and excels in long speech generation with word error rates of 2.36% for Chinese and 2.81% for English in 10-minute speech [2] - Qwen3-TTS-VoiceClone outperforms MiniMax and ElevenLabs in stability for Chinese and English cloning, average word error rates across multilingual test sets, and speaker similarity [2]