邱锡鹏团队开源MOSS-TTSD！百万小时音频训练，突破AI播客恐怖谷

Core Viewpoint - The article discusses the launch of MOSS-TTSD, a revolutionary text-to-speech model that significantly enhances the quality of dialogue synthesis, overcoming previous limitations in generating natural-sounding conversational audio [3][5]. Group 1: MOSS-TTSD Overview - MOSS-TTSD is developed through collaboration between Shanghai Chuangzhi Academy, Fudan University, and MoSi Intelligent, marking a significant advancement in AI podcasting technology [3]. - The model is open-source, allowing for unrestricted commercial applications, and is capable of generating high-quality dialogue audio from complete multi-speaker text [4][5]. Group 2: Technical Innovations - MOSS-TTSD is based on the Qwen3-1.7B-base model and trained on approximately 1 million hours of single-speaker and 400,000 hours of dialogue audio data, enabling bilingual speech synthesis [13]. - The core innovation lies in the XY-Tokenizer, which compresses bitrates to 1kbps while effectively modeling both semantic and acoustic information [15][16]. Group 3: Data Processing and Quality Assurance - The team implemented an efficient data processing pipeline to filter high-quality audio from vast datasets, utilizing an internal speaker separation model that outperforms existing solutions [24][27]. - The model achieved a Diarization Error Rate (DER) of 9.7 and 14.1 on various datasets, indicating superior performance in speaker separation tasks [29]. Group 4: Performance Evaluation - MOSS-TTSD was evaluated using a high-quality test set of approximately 500 bilingual dialogues, demonstrating significant improvements in speaker switching accuracy and voice similarity compared to baseline models [31][34]. - The model's prosody and naturalness were found to be far superior to those of competing models, showcasing its effectiveness in generating realistic dialogue [35].