音频大模型
Search documents
A16z 4100万美元领投Mirelo,重磅押注欧洲音频大模型
深思SenseAI· 2025-12-27 01:11
Core Insights - The article discusses the rapid evolution of AI video generation, highlighting the decreasing marginal costs of video production and the significant improvements in generation speed and controllability. It introduces Mirelo AI, a European audio company that recently secured $41 million in seed funding to develop audio models that automatically generate sound effects and music for videos, addressing a major pain point for AI creators [1][2]. Group 1: Company Overview - Mirelo AI focuses on creating audio solutions for video content and gaming, offering two main products: Mirelo Studio for creators (B2C) and an API for platforms and enterprises (B2B) [2][5]. - The company was founded by CJ Simon-Gabriel and Florian, both of whom have extensive backgrounds in AI research and music, which informs their approach to audio model development [3][4]. Group 2: Technology and Models - Mirelo AI has developed two key models: a music model and a video-to-sound-effect model, both of which have performed exceptionally well in evaluations, even against larger competitors [6][12]. - The audio models are significantly smaller and require 50 times less computational power compared to typical large language models, making them more efficient and cost-effective [8][9]. Group 3: Market Position and Strategy - The company aims to educate the market on the importance of audio in video production, asserting that sound quality can significantly impact viewer engagement and revenue [20][21]. - Mirelo AI plans to expand its team and capabilities, focusing on both audio effects and music, while also enhancing editing capabilities to cater to a broader audience, including professional users [17][19]. Group 4: Funding and Future Outlook - The recent $41 million seed funding round, led by Index Ventures and Andreessen Horowitz, reflects investor confidence in Mirelo AI's technology and team, especially given their ability to achieve leading benchmarks with minimal investment [11][12]. - The company envisions a future where audio is recognized as a critical component of video content, aiming to integrate their models into various platforms and enhance the overall quality of audio in video production [14][16].
没想到,音频大模型开源最彻底的,居然是小红书
机器之心· 2025-09-17 09:37
Core Viewpoint - The article highlights the recent surge in open-source AI models in the audio domain, particularly by domestic companies in China, with a focus on the advancements made by Xiaohongshu in developing high-quality audio models and fostering an open-source community [1][4][22]. Summary by Sections Open Source Trends - In recent months, open-source has become a focal point in the AI community, especially among domestic tech companies, with 33 and 31 models being open-sourced in July and August respectively [1]. - The majority of these open-source efforts are concentrated in text, image, video, reasoning, and world models, while audio generation remains a smaller segment [1][2]. Xiaohongshu's Contributions - Xiaohongshu has maintained a steady rhythm of open-sourcing audio technologies since last year, releasing models like FireRedTTS for text-to-speech (TTS) and FireRedASR for automatic speech recognition (ASR), achieving state-of-the-art (SOTA) results [3][4]. - The open-sourcing of high-quality audio models enhances Xiaohongshu's technical influence and signals a long-term strategic commitment to open-source development [4][22]. Technical Achievements - Xiaohongshu's FireRedTTS model allows for flexible voice synthesis, enabling the imitation of various speaking styles with minimal training [6][9]. - FireRedASR has achieved a character error rate (CER) of 3.05%, outperforming other closed-source models [7][8]. - The new FireRedTTS-2 model addresses existing challenges in voice synthesis, providing superior solutions for long dialogue synthesis and achieving industry-leading performance in audio scene modeling [9][11]. Ecosystem Development - Xiaohongshu aims to build a comprehensive open-source community around audio models, covering TTS, ASR, and voice dialogue systems, thereby lowering industry entry barriers and fostering innovation [22][23]. - The introduction of FireRedChat, a fully open-source duplex voice dialogue system, represents a significant advancement, providing a complete solution for developers to create their own voice assistants [17][22]. Future Plans - Xiaohongshu plans to release additional models, including FireRedMusic and FireRedASR-2, to further enhance its audio technology stack and support a broader range of applications [22][26]. - The company is committed to establishing itself as a leader in the open-source audio domain, with a focus on creating industrial-grade, commercially viable models [23][26]. Industry Impact - The article emphasizes that open-source initiatives are reshaping the AI landscape, making advanced capabilities accessible to a wider audience and fostering a collaborative environment for innovation [25][26].
AI集体“听不懂”!MMAR基准测试揭示音频大模型巨大短板
量子位· 2025-06-09 05:24
Core Viewpoint - The MMAR benchmark reveals that most AI models struggle significantly with complex audio reasoning tasks, indicating a gap in their practical applicability in real-world scenarios [1][9][18]. Summary by Sections MMAR Benchmark Overview - MMAR stands for "A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix," consisting of 1000 high-quality audio understanding questions that require multi-step reasoning capabilities [2][3]. Difficulty of MMAR - The benchmark includes questions that assess various reasoning levels, such as signal, perception, semantic, and cultural understanding, with tasks requiring complex reasoning skills and domain-specific knowledge [6][9]. Model Performance - A total of 30 audio-related models were tested, with the best open-source model, Qwen-2.5-Omni, achieving an average accuracy of only 56.7%, while the closed-source model Gemini 2.0 Flash led with 65.6% [11][18]. - Most open-source models performed close to random guessing, particularly in music-related tasks, highlighting significant challenges in recognizing deeper audio information [12][18]. Error Analysis - The primary errors identified in the models included perceptual errors (37%), reasoning errors (20%), knowledge gaps (9%), and other errors (34%), indicating that current AI models face both auditory and cognitive challenges [19]. Future Outlook - The research emphasizes the need for collaboration in data and algorithm innovation to improve audio reasoning capabilities in AI, with a hope for future models that can truly understand audio content and context [20][21].