Workflow
Higgs Audio V2模型
icon
Search documents
李沐B站更新了!教你手搓语音大模型,代码全开源还能在线试玩
量子位· 2025-07-23 06:36
Core Insights - The article discusses the return of Li Mu and his new audio model, Higgs Audio V2, which integrates text and speech processing capabilities [1][2]. Group 1: Model Capabilities - Higgs Audio V2 can handle various speech tasks, including generating multilingual dialogues, automatic prosody adjustment, melody humming with cloned voices, and simultaneous generation of speech and background music [3][4]. - The model integrates 10 million hours of speech data into the training of a large language model (LLM), enabling it to both understand and generate speech [4][6]. Group 2: Technical Implementation - The model combines traditional text and speech models, allowing LLMs to communicate using speech by converting speech tasks into a unified processing format [7][8]. - A unified discretization audio tokenizer was developed to maintain audio quality while capturing semantic and acoustic features at a rate of 25 frames per second [11][13]. - The training data for the model was sourced from various platforms, ensuring quality by filtering out 90% of the data to meet the 10 million hours requirement [14][15]. Group 3: Model Training and Architecture - To enhance the model's understanding and generation of sound, a secondary audio model, AudioVerse, was trained to analyze user speech input and provide contextual information for the main model [16]. - The final multimodal model can perform complex tasks, such as writing and singing a song with accompanying music, and can analyze scenes and characters based on audio input [17][18]. Group 4: Performance Metrics - In real-time voice chat, the model achieves low latency and can understand and express emotions, outperforming other models in emotional and question categories by 75.7% and 55.7%, respectively [19]. - The model also excelled in traditional TTS benchmark tests, achieving the best performance in various evaluations [20]. Group 5: Accessibility and Community Engagement - The model's code has been made publicly available on GitHub, along with an online demo platform for users to experiment with [23][31]. - The article encourages users, especially those interested in creating content like virtual streamers, to try the model for voice cloning and other applications [25].