语音AI

Search documents
7月12日电,Meta收购语音AI初企企业PlayAI。
news flash· 2025-07-11 23:04
Group 1 - Meta has acquired the voice AI startup PlayAI [1] - This acquisition indicates Meta's continued investment in artificial intelligence technologies [1] - The move is part of Meta's strategy to enhance its capabilities in voice recognition and AI-driven applications [1]
首个全面梳理语音大模型发展脉络的权威综述,入选ACL 2025主会
机器之心· 2025-06-17 04:50
想象一下,如果 AI 能够像人类一样自然地进行语音对话,不再需要传统的 「 语音转文字(ASR)- 文本大模型处理(LLM)- 文字转语音(TTS) 」 的 繁琐流程,而是直接理解和生成语音,那将是怎样的体验?这就是 语音大模型 (语音语言模型,SpeechLM)要解决的核心问题。 传统的语音交互系统存在三大痛点:信息丢失、延迟严重、错误累积。当语音转换为文字时,音调、语气、情感等副语言信息完全丢失;多个模块串联导致 响应延迟明显;每个环节的错误会层层累积,最终影响整体效果。 SpeechLM 的出现彻底改变了这一局面。它能够端到端地处理语音,既保留了语音中的丰富信息,又大幅降低了延迟,为真正自然的人机语音交互铺平了 道路。 本文第一作者:崔文谦,香港中文大学博士生,致力于语音大模型,多模态大模型,AI音乐生成等方向的研究。 由香港中文大学团队撰写的语音语言模型综述论文《Recent Advances in Speech Language Models: A Survey》已成功被 ACL 2025 主会议接收!这 是该领域首个全面系统的综述,为语音 AI 的未来发展指明了方向。 ArXiv链接:https: ...
超越OpenAI、ElevenLabs,MiniMax新一代语音模型屠榜!人格化语音时代来了
机器之心· 2025-05-15 06:04
Core Viewpoint - The rapid advancement of domestic large models has surpassed expectations, with MiniMax's new TTS model "Speech-02" achieving top rankings in international voice evaluation, outperforming major competitors like OpenAI and ElevenLabs [1][7][20]. Group 1: Model Performance - Speech-02 has achieved state-of-the-art (SOTA) results in key voice cloning metrics such as Word Error Rate (WER) and speaker similarity [1][20]. - The model's cost is only one-fourth that of ElevenLabs' competing model, indicating a strong cost-performance ratio [4]. - Speech-02 demonstrates superior performance in zero-shot voice cloning, requiring only a short audio sample to replicate a speaker's voice without additional training [12][14]. Group 2: Technical Innovations - The model employs a self-regressive Transformer architecture and introduces two major innovations: zero-shot voice cloning and a new Flow-VAE architecture [12][13]. - The zero-shot capability allows the model to generate highly similar target voices from a reference audio without needing text prompts, significantly reducing the time and data required for training [14]. - The Flow-VAE architecture enhances the representation of voice features, improving the overall quality and similarity of the generated audio [17]. Group 3: Multilingual and Cross-Language Capabilities - Speech-02 supports 32 languages and excels particularly in Chinese, English, Cantonese, Portuguese, and French [38]. - The model outperforms ElevenLabs' multilingual_v2 in both WER and speaker similarity across multiple languages, demonstrating its ability to handle complex tonal systems and diverse phonetic inventories [25][26]. - In cross-language tests, Speech-02 shows lower WER and higher similarity scores, validating the effectiveness of its speaker encoder architecture [28]. Group 4: User Experience and Personalization - The model features a "voice reference" function that allows users to clone voices by providing a short audio sample, enhancing personalization [34]. - Users can control the emotional tone of the generated speech, with options for various emotions such as sadness, happiness, and anger [37]. - Speech-02 enables seamless switching between different languages and styles, providing a highly flexible and interactive user experience [41]. Group 5: Market Position and Future Prospects - MiniMax, established in 2021, focuses on AI products for both consumer and business markets, emphasizing a "model as product" philosophy [44]. - The company is exploring various applications for its voice model, including voice assistants and educational tools, aiming to enhance the efficiency and personalization of smart voice content creation [44]. - The advancements in Speech-02 position MiniMax at the forefront of the voice AI industry, indicating a critical turning point towards large-scale application of voice model technology [44].
速递|两名本科生3个月打造的AI语音模型,挑战谷歌NotebookLM,16亿参数实现自然对话生成
Z Potentials· 2025-04-23 03:49
Core Insights - The article discusses the emergence of a new AI speech model called Dia, developed by Nari Labs, which aims to rival Google's NotebookLM in generating podcast-style audio clips [1][2]. Group 1: Market Potential and Investment - The market for synthetic voice tools is substantial and continues to grow, with ElevenLabs being a major player alongside challengers like PlayAI and Sesame [1]. - According to PitchBook, startups developing voice AI technology raised over $398 million in venture capital last year [2]. Group 2: Technical Aspects of Dia - Dia has 1.6 billion parameters and can generate dialogue from scripts, allowing users to customize the speaker's tone and insert non-verbal cues like coughs and laughter [2][3]. - The model can be accessed via AI development platforms like Hugging Face and GitHub, and it runs on modern PCs with at least 10GB of VRAM [3]. Group 3: Ethical Concerns and Future Plans - Dia lacks protective measures against misuse, making it easy to create false information or fraudulent recordings [4]. - Nari Labs has not disclosed the data sources used for training Dia, raising concerns about potential copyright infringement [5]. - The company plans to create a synthetic voice platform with social features and intends to release a technical report on Dia, expanding support to languages beyond English [5].