工业级稳定可用、零样本歌声合成,Soul App 联合吉利汽车研究院人工智能中心(AIC)、天津大学及西北工业大学开源SoulX-Singer
Jin Rong Jie·2026-02-10 03:02

Core Insights - The article discusses the slow progress in the Singing Voice Synthesis (SVS) field despite advancements in generative AI within the music industry, highlighting the launch of the SoulX-Singer model as a significant development in this area [1][6][17]. Group 1: SoulX-Singer Model Overview - SoulX-Singer is an open-source, high-quality zero-shot singing voice synthesis model designed for real-world applications, trained on over 42,000 hours of data covering multiple languages, vocal timbres, and singing styles [1][9][17]. - The model aims to achieve stable, natural, and highly controllable singing voice generation without prior exposure to the singer's voice [7][9]. Group 2: Technical Features - SoulX-Singer employs a Flow Matching-based generative modeling paradigm, treating singing voice synthesis as an audio infilling task, with a focus on the strong coupling of lyrics, melody, and vocalization [7][8]. - The model incorporates a note-level alignment mechanism to accurately model and independently control the start and end times, pitch, and duration of each note, allowing for flexible adjustments during the generation phase [8][9]. Group 3: Control Mechanisms - The model supports two control methods for voice synthesis: Music Score (MIDI) driven generation for direct lyric and score-based singing, and Melody driven generation for replicating singing techniques from existing songs [10][11]. - This dual control paradigm enhances flexibility in music production, catering to various creative needs from original compositions to re-creations of existing songs [11]. Group 4: Multilingual Support - SoulX-Singer currently supports singing voice synthesis in Mandarin, English, and Cantonese, demonstrating consistent quality across different languages and musical styles, which broadens its application in content creation and interactive entertainment [12][17]. Group 5: Performance Evaluation - The model has been systematically evaluated on tasks such as zero-shot singing voice synthesis and cross-language synthesis, showing superior performance in clarity, singer similarity, pitch consistency, and overall synthesis quality compared to previous models [15][17]. - In subjective listening tests, SoulX-Singer achieved notable advantages over existing solutions, reinforcing its robustness and usability in real-world scenarios [15][17].