Workflow
音频大模型
icon
Search documents
A16z 4100万美元领投Mirelo,重磅押注欧洲音频大模型
深思SenseAI· 2025-12-27 01:11
Seb Johnson : 大家好,欢迎回到《Scaling Europe》节目。我是 Seb Johnson。我和 CJ Simon-Gabriel 一起在这里。CJ 是 Mirelo AI 的联合创始人 之一。 Mirelo AI 刚刚宣布了一个非常夸张的 4100 万美元种子轮 ,由 A16z 和 Index Ventures领投。 这是一笔很大的融资,而且由一些真正的顶级 VC 领投。我觉得特别有意思的是,你们是在欧洲做一个" 基础模型 "。所以对那些不了解的人,你能不能先快 速介绍一下 Mirelo AI ? CJ: 谢谢你邀请我。 我们主要聚焦在为 视频内容和游戏 做"音频"。所以我们现在做的主要是 音乐和 音效 。 我们的想法其实很简单, 你把你的视频给我,我们告诉你"哪里该用什么声音",并且把音频生成出来 。你可以生成音效,也可以加上音乐。 Seb Johnson: 你为什么决定做这个业务? 过去一年,AI 视频生成在模型能力与产品形态上快速迭代,视频产出的边际成本持续下降,生成速度与可控性显著提升。今天不少 AI 创作者都经历过:画面 几分钟出片,真正让人头大的,是后面的音效、配乐、节奏、氛 ...
没想到,音频大模型开源最彻底的,居然是小红书
机器之心· 2025-09-17 09:37
Core Viewpoint - The article highlights the recent surge in open-source AI models in the audio domain, particularly by domestic companies in China, with a focus on the advancements made by Xiaohongshu in developing high-quality audio models and fostering an open-source community [1][4][22]. Summary by Sections Open Source Trends - In recent months, open-source has become a focal point in the AI community, especially among domestic tech companies, with 33 and 31 models being open-sourced in July and August respectively [1]. - The majority of these open-source efforts are concentrated in text, image, video, reasoning, and world models, while audio generation remains a smaller segment [1][2]. Xiaohongshu's Contributions - Xiaohongshu has maintained a steady rhythm of open-sourcing audio technologies since last year, releasing models like FireRedTTS for text-to-speech (TTS) and FireRedASR for automatic speech recognition (ASR), achieving state-of-the-art (SOTA) results [3][4]. - The open-sourcing of high-quality audio models enhances Xiaohongshu's technical influence and signals a long-term strategic commitment to open-source development [4][22]. Technical Achievements - Xiaohongshu's FireRedTTS model allows for flexible voice synthesis, enabling the imitation of various speaking styles with minimal training [6][9]. - FireRedASR has achieved a character error rate (CER) of 3.05%, outperforming other closed-source models [7][8]. - The new FireRedTTS-2 model addresses existing challenges in voice synthesis, providing superior solutions for long dialogue synthesis and achieving industry-leading performance in audio scene modeling [9][11]. Ecosystem Development - Xiaohongshu aims to build a comprehensive open-source community around audio models, covering TTS, ASR, and voice dialogue systems, thereby lowering industry entry barriers and fostering innovation [22][23]. - The introduction of FireRedChat, a fully open-source duplex voice dialogue system, represents a significant advancement, providing a complete solution for developers to create their own voice assistants [17][22]. Future Plans - Xiaohongshu plans to release additional models, including FireRedMusic and FireRedASR-2, to further enhance its audio technology stack and support a broader range of applications [22][26]. - The company is committed to establishing itself as a leader in the open-source audio domain, with a focus on creating industrial-grade, commercially viable models [23][26]. Industry Impact - The article emphasizes that open-source initiatives are reshaping the AI landscape, making advanced capabilities accessible to a wider audience and fostering a collaborative environment for innovation [25][26].
AI集体“听不懂”!MMAR基准测试揭示音频大模型巨大短板
量子位· 2025-06-09 05:24
Core Viewpoint - The MMAR benchmark reveals that most AI models struggle significantly with complex audio reasoning tasks, indicating a gap in their practical applicability in real-world scenarios [1][9][18]. Summary by Sections MMAR Benchmark Overview - MMAR stands for "A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix," consisting of 1000 high-quality audio understanding questions that require multi-step reasoning capabilities [2][3]. Difficulty of MMAR - The benchmark includes questions that assess various reasoning levels, such as signal, perception, semantic, and cultural understanding, with tasks requiring complex reasoning skills and domain-specific knowledge [6][9]. Model Performance - A total of 30 audio-related models were tested, with the best open-source model, Qwen-2.5-Omni, achieving an average accuracy of only 56.7%, while the closed-source model Gemini 2.0 Flash led with 65.6% [11][18]. - Most open-source models performed close to random guessing, particularly in music-related tasks, highlighting significant challenges in recognizing deeper audio information [12][18]. Error Analysis - The primary errors identified in the models included perceptual errors (37%), reasoning errors (20%), knowledge gaps (9%), and other errors (34%), indicating that current AI models face both auditory and cognitive challenges [19]. Future Outlook - The research emphasizes the need for collaboration in data and algorithm innovation to improve audio reasoning capabilities in AI, with a hope for future models that can truly understand audio content and context [20][21].