Workflow
超越OpenAI、ElevenLabs,MiniMax新一代语音模型屠榜!人格化语音时代来了
机器之心·2025-05-15 06:04

Core Viewpoint - The rapid advancement of domestic large models has surpassed expectations, with MiniMax's new TTS model "Speech-02" achieving top rankings in international voice evaluation, outperforming major competitors like OpenAI and ElevenLabs [1][7][20]. Group 1: Model Performance - Speech-02 has achieved state-of-the-art (SOTA) results in key voice cloning metrics such as Word Error Rate (WER) and speaker similarity [1][20]. - The model's cost is only one-fourth that of ElevenLabs' competing model, indicating a strong cost-performance ratio [4]. - Speech-02 demonstrates superior performance in zero-shot voice cloning, requiring only a short audio sample to replicate a speaker's voice without additional training [12][14]. Group 2: Technical Innovations - The model employs a self-regressive Transformer architecture and introduces two major innovations: zero-shot voice cloning and a new Flow-VAE architecture [12][13]. - The zero-shot capability allows the model to generate highly similar target voices from a reference audio without needing text prompts, significantly reducing the time and data required for training [14]. - The Flow-VAE architecture enhances the representation of voice features, improving the overall quality and similarity of the generated audio [17]. Group 3: Multilingual and Cross-Language Capabilities - Speech-02 supports 32 languages and excels particularly in Chinese, English, Cantonese, Portuguese, and French [38]. - The model outperforms ElevenLabs' multilingual_v2 in both WER and speaker similarity across multiple languages, demonstrating its ability to handle complex tonal systems and diverse phonetic inventories [25][26]. - In cross-language tests, Speech-02 shows lower WER and higher similarity scores, validating the effectiveness of its speaker encoder architecture [28]. Group 4: User Experience and Personalization - The model features a "voice reference" function that allows users to clone voices by providing a short audio sample, enhancing personalization [34]. - Users can control the emotional tone of the generated speech, with options for various emotions such as sadness, happiness, and anger [37]. - Speech-02 enables seamless switching between different languages and styles, providing a highly flexible and interactive user experience [41]. Group 5: Market Position and Future Prospects - MiniMax, established in 2021, focuses on AI products for both consumer and business markets, emphasizing a "model as product" philosophy [44]. - The company is exploring various applications for its voice model, including voice assistants and educational tools, aiming to enhance the efficiency and personalization of smart voice content creation [44]. - The advancements in Speech-02 position MiniMax at the forefront of the voice AI industry, indicating a critical turning point towards large-scale application of voice model technology [44].