小红书智创音频技术团队：SOTA对话生成模型FireRedTTS-2来了，轻松做出AI播客！

Core Insights - The article discusses the launch of FireRedTTS-2, a new conversational speech synthesis model developed by Xiaohongshu's audio technology team, which addresses existing issues in dialogue synthesis such as poor flexibility, high pronunciation errors, unstable speaker switching, and unnatural prosody [2][24]. Group 1: Model Features and Improvements - FireRedTTS-2 upgrades two core modules of the TTS system: a discrete speech encoder and a text-to-speech model, enhancing synthesis quality and flexibility [11][24]. - The discrete speech encoder operates at a low frame rate of 12.5Hz, compressing continuous speech signals into discrete label sequences, which reduces the length of speech sequences and improves processing speed [14][16]. - The text-to-speech model supports sentence-by-sentence generation, allowing for easier editing and adaptation to various scenarios, and utilizes a "dual Transformer" architecture to generate more natural and coherent dialogue [17][18]. Group 2: Performance Evaluation - FireRedTTS-2 outperforms other systems like MoonCast, ZipVoice-Dialogue, and MOSS-TTSD in both subjective and objective metrics, significantly reducing pronunciation errors and improving prosody [20][24]. - In subjective evaluations, 28% of samples were rated as more natural than real podcast recordings, with 56% of samples achieving a naturalness level that meets or exceeds real recordings [22][24]. Group 3: Application and Future Prospects - The model supports multiple languages including Chinese, English, Japanese, Korean, and French, making it a versatile tool for generating high-quality audio data for various applications [7][24]. - Future developments will focus on expanding the number of supported speakers and languages, as well as introducing controllable sound effects [25].