通义百聆迎来重磅升级 Fun-CosyVoice3正式开源可实现极速克隆音色

Core Insights - The "Tongyi" team has announced significant upgrades to its Fun-CosyVoice3 model, including a 50% reduction in initial latency and a doubling of accuracy for mixed Chinese-English speech recognition [1][2] - The Fun-CosyVoice3 model is now open-source, featuring zero-shot voice cloning capabilities and supporting local deployment and customization [1] - The Fun-ASR-Nano model has been introduced, with a reduced parameter count of 0.8 billion, aimed at lowering inference costs while also being open-source [1] Group 1 - Fun-CosyVoice3 model has achieved a 50% reduction in initial latency, enabling real-time applications such as voice assistants and live dubbing [2] - The word error rate (WER) for mixed Chinese-English speech has decreased by 56.4%, improving accuracy in complex sentences and professional terminology [2] - The model supports 9 common languages and 18 Chinese dialects, with cross-lingual voice cloning capabilities, allowing for high consistency in voice reproduction across different languages [2] Group 2 - The Fun-ASR model has undergone comprehensive upgrades, enhancing robustness in noisy environments and supporting multilingual speech recognition [2] - The first-word latency for the streaming recognition model has been reduced to 160 milliseconds, improving responsiveness in applications [2] - Fun-ASR has been successfully implemented in various scenarios, including DingTalk's "AI Listening" and video conferencing [2]