Workflow
B站下场自研AI配音!纯正美音版甄嬛传流出,再不用看小红书学英语了(Doge)
BILIBILIBILIBILI(US:BILI) 量子位·2025-07-14 09:08

Core Viewpoint - The article discusses the advancements in AI voice synthesis technology, specifically focusing on the new TTS model IndexTTS2 developed by Bilibili, which allows for precise control over speech duration and emotional expression in generated audio [6][11][33]. Group 1: Technology Features - IndexTTS2 can replicate the original tone and emotion while ensuring lip-sync accuracy [3][11]. - The model supports two generation methods: one with explicit token count for precise duration control and another that automatically generates speech while preserving rhythmic features [12][16]. - It allows independent control of audio and emotional expression, enabling different audio prompts to serve as references for tone and emotion [19][20]. Group 2: Performance Evaluation - IndexTTS2 achieved state-of-the-art (SOTA) results in various tests, with a word error rate (WER) of only 1.883% and emotional performance metrics also reaching SOTA levels [22][24]. - In the AIShell-1 test, IndexTTS2 was only 0.004 behind the ground truth in SS and 0.038% better than the previous version [23]. - The model's accuracy in duration control showed token count errors below 0.02% [25]. Group 3: Model Architecture - IndexTTS2 consists of three core modules: Text-to-Semantic (T2S), Semantic-to-Speech (S2M), and a vocoder [38]. - The model introduces innovations in duration and emotional control, utilizing a conditioning mechanism to extract emotional features from style prompts [40][41]. - The S2M module enhances speech stability by integrating GPT latent representations, addressing issues of clarity in emotional speech synthesis [44][46]. Group 4: Industry Implications - Bilibili is reportedly accelerating its video podcast strategy, which may integrate the capabilities of IndexTTS2 [47][49]. - The development of IndexTTS2 could be part of a broader initiative referred to as "Project H," aimed at enhancing AI-driven content creation [50].