通义百聆迎来重磅升级 Fun-CosyVoice3(0.5B)正式开源可实现极速克隆音色

Core Insights - The "Tongyi" team has announced significant upgrades to the Fun-CosyVoice3 model, including a 50% reduction in initial package latency and a doubling of accuracy for mixed Chinese-English speech recognition [1][2] - The Fun-CosyVoice3 model is now open-source, featuring zero-shot voice cloning capabilities that allow for voice synthesis from a 3-second audio reference, supporting local deployment and customization [1][2] Group 1 - Fun-CosyVoice3 model upgrades include a 50% reduction in initial package latency, enabling real-time applications such as voice assistants and live dubbing [2] - The word error rate (WER) for mixed Chinese-English speech has decreased by 56.4%, improving accuracy in complex sentences with professional terminology and mixed case [2] - The model supports 9 general languages and 18 Chinese dialects, with cross-linguistic voice cloning capabilities, allowing for high consistency in voice quality across different languages [2] Group 2 - The Fun-ASR model has also been enhanced, trained on tens of millions of hours of real speech data, and has been widely implemented in applications like DingTalk's "AI Listening" and video conferencing [2] - Key improvements to the Fun-ASR model include robustness in noisy environments, multilingual mixed speech capabilities, and reduced initial word recognition latency to 160 milliseconds [2]