Workflow
音频大模型
icon
Search documents
没想到,音频大模型开源最彻底的,居然是小红书
机器之心· 2025-09-17 09:37
Core Viewpoint - The article highlights the recent surge in open-source AI models in the audio domain, particularly by domestic companies in China, with a focus on the advancements made by Xiaohongshu in developing high-quality audio models and fostering an open-source community [1][4][22]. Summary by Sections Open Source Trends - In recent months, open-source has become a focal point in the AI community, especially among domestic tech companies, with 33 and 31 models being open-sourced in July and August respectively [1]. - The majority of these open-source efforts are concentrated in text, image, video, reasoning, and world models, while audio generation remains a smaller segment [1][2]. Xiaohongshu's Contributions - Xiaohongshu has maintained a steady rhythm of open-sourcing audio technologies since last year, releasing models like FireRedTTS for text-to-speech (TTS) and FireRedASR for automatic speech recognition (ASR), achieving state-of-the-art (SOTA) results [3][4]. - The open-sourcing of high-quality audio models enhances Xiaohongshu's technical influence and signals a long-term strategic commitment to open-source development [4][22]. Technical Achievements - Xiaohongshu's FireRedTTS model allows for flexible voice synthesis, enabling the imitation of various speaking styles with minimal training [6][9]. - FireRedASR has achieved a character error rate (CER) of 3.05%, outperforming other closed-source models [7][8]. - The new FireRedTTS-2 model addresses existing challenges in voice synthesis, providing superior solutions for long dialogue synthesis and achieving industry-leading performance in audio scene modeling [9][11]. Ecosystem Development - Xiaohongshu aims to build a comprehensive open-source community around audio models, covering TTS, ASR, and voice dialogue systems, thereby lowering industry entry barriers and fostering innovation [22][23]. - The introduction of FireRedChat, a fully open-source duplex voice dialogue system, represents a significant advancement, providing a complete solution for developers to create their own voice assistants [17][22]. Future Plans - Xiaohongshu plans to release additional models, including FireRedMusic and FireRedASR-2, to further enhance its audio technology stack and support a broader range of applications [22][26]. - The company is committed to establishing itself as a leader in the open-source audio domain, with a focus on creating industrial-grade, commercially viable models [23][26]. Industry Impact - The article emphasizes that open-source initiatives are reshaping the AI landscape, making advanced capabilities accessible to a wider audience and fostering a collaborative environment for innovation [25][26].
AI集体“听不懂”!MMAR基准测试揭示音频大模型巨大短板
量子位· 2025-06-09 05:24
Core Viewpoint - The MMAR benchmark reveals that most AI models struggle significantly with complex audio reasoning tasks, indicating a gap in their practical applicability in real-world scenarios [1][9][18]. Summary by Sections MMAR Benchmark Overview - MMAR stands for "A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix," consisting of 1000 high-quality audio understanding questions that require multi-step reasoning capabilities [2][3]. Difficulty of MMAR - The benchmark includes questions that assess various reasoning levels, such as signal, perception, semantic, and cultural understanding, with tasks requiring complex reasoning skills and domain-specific knowledge [6][9]. Model Performance - A total of 30 audio-related models were tested, with the best open-source model, Qwen-2.5-Omni, achieving an average accuracy of only 56.7%, while the closed-source model Gemini 2.0 Flash led with 65.6% [11][18]. - Most open-source models performed close to random guessing, particularly in music-related tasks, highlighting significant challenges in recognizing deeper audio information [12][18]. Error Analysis - The primary errors identified in the models included perceptual errors (37%), reasoning errors (20%), knowledge gaps (9%), and other errors (34%), indicating that current AI models face both auditory and cognitive challenges [19]. Future Outlook - The research emphasizes the need for collaboration in data and algorithm innovation to improve audio reasoning capabilities in AI, with a hope for future models that can truly understand audio content and context [20][21].