Workflow
机器之心
icon
Search documents
LLaSO 横空出世:逻辑智能推出全球首个完全开源语音大模型框架,定义 LSLM 研究新基准
机器之心· 2025-09-14 05:16
论文标题:L LaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model 在大型语言模型(LLM)的浪潮下,多模态 AI 取得了飞速发展,尤其是在视觉语言(LVLM)领域,已经形成了成熟的研究范式。然而,与之形成鲜明对比的 是,大型语音语言模型(LSLM)的发展却显得零散且步调缓慢。 该领域长期被碎片化的架构、不透明的训练数据和缺失的评估标准所困扰,导致研究之间难以进行公平比较,严重阻碍了技术的可复现性和社区的系统性进步。 许多研究虽然发布了模型权重,但其赖以成功的关键 —— 训练数据和配置细节 —— 却常常被 "雪藏" 起来。 为了打破这一僵局, 北京深度逻辑智能科技有限公司推出了 LLaSO —— 首个完全开放、端到端的语音语言模型研究框架。 LLaSO 旨在为整个社区提供一个统一、透明且可复现的基础设施,其贡献是 "全家桶" 式的,包含了一整套开源的数据、基准和模型,希望以此加速 LSLM 领域的 社区驱动式创新。 论文地址:https://arxiv.org/abs/2508.1 ...
为这一个Tab键,我愿意单独付费:Cursor用在线强化学习优化代码建议,护城河有了?
机器之心· 2025-09-14 03:07
编辑:+0 Cursor Tab 是 Cursor 的核心功能之一,它通过分析开发者的编码行为,智能预测并推荐后续代码,开发者仅需按下 Tab 键即可采纳。 然而,它也面临着一个 AI 普遍存在的难题:「过度热情」。有时,它提出的建议不仅毫无用处,甚至会打断开发者的思路。 问题的关键,不只是让 AI 写出更优秀的代码,更是要教会它「察言观色」:在最恰当的时机提供帮助,在其他时候则保持安静。 基于此,Cursor 采用在线强化学习技术训练出一个全新的 Tab 模型。 该模型将每一次用户交互(接受/拒绝建议)都视为一个强化信号,直接用于模型的在线优 化。 在每天超过 4 亿次请求的巨大流量驱动下,模型得以进行高频度的、基于真实世界反馈的持续学习。 机器之心报道 Cursor 已将这个新的 Tab 模型设为默认版本。与旧模型相比, 新模型提供的建议数量减少了 21%,但所提供建议的接受率却提升了 28%。 此举旨在提升用户的 编码体验,Cursor 也计划在未来继续深化这些方法的研究。 Cursor 的策略独特且高效:它每天多次向用户部署新模型(每隔 1.5-2 小时),利用实时数据进行快速训练和优化。 这与主流做 ...
大模型碰到真难题了,测了500道,o3 Pro仅通过15%
机器之心· 2025-09-14 03:07
Core Insights - The article discusses the development of a new benchmark called UQ (Unsolved Questions) to evaluate the capabilities of large language models, focusing on unsolved problems that reflect real-world challenges [2][3][5] - UQ consists of 500 challenging questions sourced from the Stack Exchange community, designed to assess reasoning, factual accuracy, and browsing capabilities of models [3][8] - The study highlights the limitations of existing benchmarks, which often prioritize difficulty over real-world applicability, and proposes a continuous evaluation method through community validation [1][5] Group 1 - UQ is a test set of 500 unsolved questions covering various topics, including computer science, mathematics, and history, aimed at evaluating model performance in a realistic context [3][8] - The selection process for UQ involved multiple filtering stages, reducing an initial pool of approximately 3 million questions to 500 through rule-based, model-based, and manual reviews [10][11] - The best-performing model in the UQ validation only succeeded in answering 15% of the questions, indicating the high difficulty level of the benchmark [5][7] Group 2 - The UQ validation process employs a composite verification strategy that leverages the strengths of different models to assess candidate answers without requiring standard answers [14][26] - The study found that using a composite validator significantly reduces self-bias and over-optimism in model evaluations, which is a common issue when models assess their own performance [24][25][26] - Results showed that a stronger answer generation model does not necessarily correlate with better answer validation performance, highlighting the complexity of model capabilities [27][28]
小红书智创音频技术团队:SOTA对话生成模型FireRedTTS-2来了,轻松做出AI播客!
机器之心· 2025-09-14 03:07
Core Insights - The article discusses the launch of FireRedTTS-2, a new conversational speech synthesis model developed by Xiaohongshu's audio technology team, which addresses existing issues in dialogue synthesis such as poor flexibility, high pronunciation errors, unstable speaker switching, and unnatural prosody [2][24]. Group 1: Model Features and Improvements - FireRedTTS-2 upgrades two core modules of the TTS system: a discrete speech encoder and a text-to-speech model, enhancing synthesis quality and flexibility [11][24]. - The discrete speech encoder operates at a low frame rate of 12.5Hz, compressing continuous speech signals into discrete label sequences, which reduces the length of speech sequences and improves processing speed [14][16]. - The text-to-speech model supports sentence-by-sentence generation, allowing for easier editing and adaptation to various scenarios, and utilizes a "dual Transformer" architecture to generate more natural and coherent dialogue [17][18]. Group 2: Performance Evaluation - FireRedTTS-2 outperforms other systems like MoonCast, ZipVoice-Dialogue, and MOSS-TTSD in both subjective and objective metrics, significantly reducing pronunciation errors and improving prosody [20][24]. - In subjective evaluations, 28% of samples were rated as more natural than real podcast recordings, with 56% of samples achieving a naturalness level that meets or exceeds real recordings [22][24]. Group 3: Application and Future Prospects - The model supports multiple languages including Chinese, English, Japanese, Korean, and French, making it a versatile tool for generating high-quality audio data for various applications [7][24]. - Future developments will focus on expanding the number of supported speakers and languages, as well as introducing controllable sound effects [25].
后训练的「分」与「合」,SFT&RL 大一统才是正解?
机器之心· 2025-09-14 01:30
Group 1 - The article discusses the limitations of the traditional "SFT followed by RL" paradigm in post-training for AI models, suggesting a unified approach that combines both methods [7][9][10] - It highlights the importance of post-training in aligning the model's capabilities with human values and preferences, addressing the challenges of "catastrophic forgetting" and overfitting associated with SFT [8][11][12] - The emerging trend in the industry is to explore a unified framework for post-training that leverages the strengths of both SFT and RL, rather than treating them as separate processes [10][15][17] Group 2 - The article evaluates the competitive landscape of AI hardware among major players like Meta, OpenAI, Apple, and Google, questioning whether AI hardware will become a new essential or merely a passing trend [2] - It raises questions about the user experience with AI hardware, such as whether it will truly replace traditional devices or simply serve as an additional feature [2][3] - The potential for innovative AI hardware forms to integrate seamlessly into daily life is explored, along with the implications for user interaction and technology adoption [2][3] Group 3 - The article examines the role of generative AI in search, debating whether it will serve as a replacement for traditional search engines or act as a growth engine for expanding user queries and intentions [3] - It discusses how multimodal interactions and conversational AI are redefining task completion for users, potentially enhancing the value of advertising and commercial opportunities [3] - Google's strategy of gradually integrating AI capabilities into its products, rather than waiting for full technological maturity, reflects a proactive approach to product development and market positioning [3]
快手可灵团队提出MIDAS:压缩比64倍、延迟低于500ms,多模态互动数字人框架实现交互生成新突破
机器之心· 2025-09-13 08:54
Core Viewpoint - The article discusses the rapid development of digital human video generation technology, highlighting the introduction of the MIDAS framework by Kuaishou's Kling Team, which addresses significant challenges in real-time, multimodal control, and long-term consistency in digital human interactions [2][16]. Group 1: MIDAS Framework Overview - MIDAS (Multimodal Interactive Digital-human Synthesis) combines autoregressive video generation with lightweight diffusion denoising heads to achieve real-time, smooth digital human video synthesis under multimodal conditions [2][5]. - The system demonstrates three core advantages: high compression rates, low latency, and efficient denoising, making it suitable for real-time interactive applications [4][14]. Group 2: Technical Innovations - The framework utilizes a 64× compression ratio autoencoder, reducing each frame to a maximum of 60 tokens, significantly lowering computational load [4][8]. - MIDAS supports various input signals, including audio, posture, and text, through a unified multimodal condition projector that encodes different modalities into a shared latent space [5][12]. - The model architecture employs a Qwen2.5-3B autoregressive backbone with a diffusion head based on PixArt-α/mlp structure, ensuring coherence in generated outputs while minimizing computational delays [12][16]. Group 3: Training and Data - A large-scale multimodal dialogue dataset of approximately 20,000 hours was constructed to train the model, encompassing single and dual dialogue scenarios across multiple languages and styles [10][12]. - The training strategy includes controllable noise injection to mitigate exposure bias during inference, enhancing the model's performance [12]. Group 4: Application Scenarios - MIDAS can generate real-time dual-person dialogue, synchronizing lip movements, expressions, and listening postures with audio streams [13]. - The model achieves cross-language singing synthesis without explicit language identifiers, maintaining lip-sync across Chinese, Japanese, and English songs for videos up to 4 minutes long [13][14]. - MIDAS demonstrates potential as an interactive world model by responding to directional control signals in environments like Minecraft, showcasing scene consistency and memory capabilities [13][14]. Group 5: Future Directions - The team plans to explore higher resolution and more complex interaction logic in future developments, aiming to deploy the system in real product environments [17].
清华、上海AI Lab等顶级团队发布推理模型RL超全综述,探索通往超级智能之路
机器之心· 2025-09-13 08:54
Core Insights - The article emphasizes the significant role of Reinforcement Learning (RL) in enhancing the reasoning capabilities of large language models (LLMs), marking a pivotal shift in artificial intelligence development [2][5][16] - It highlights the emergence of Large Reasoning Models (LRMs) that utilize RL to improve reasoning through verifiable rewards, showcasing advancements in complex tasks such as mathematics and programming [3][5][10] Summary by Sections Introduction - The introduction outlines the historical context of RL since its inception in 1998 and its evolution into a crucial method for training intelligent agents to surpass human performance in complex environments [2] Recent Trends - A new trend is emerging where researchers aim to enhance models' reasoning abilities through RL, moving beyond mere compliance to actual reasoning skills [3][5] Overview of RL in LRM - The article reviews recent advancements in RL applied to LLMs, noting significant achievements in complex logical tasks, and identifies RL as a core method for evolving LLMs into LRMs [5][12] Foundational Components - The foundational components of RL for LRMs include reward design, policy optimization, and sampling strategies, which are essential for effective model training [13][14] Foundational Problems - Key challenges in RL for LRMs include the design of appropriate reward signals, efficient scaling under computational and data constraints, and ensuring reliability in practical applications [12][16] Training Resources - The article discusses the necessary training resources, including static corpora, dynamic environments, and RL infrastructure, emphasizing the need for standardization and development [13][15] Applications - RL has been applied across various tasks, including coding, agentic tasks, multimodal tasks, and robotics, showcasing its versatility and potential for broader applications [13][15] Future Directions - Future research directions for RL in LLMs include the development of new algorithms, mechanisms, and functionalities to further enhance reasoning capabilities and address existing challenges [15][16]
Meta开源MobileLLM-R1模型,不到1B参数,用1/10的训练就超越了Qwen3
机器之心· 2025-09-13 08:54
Core Viewpoint - Meta AI has officially released the MobileLLM-R1 series, which includes efficient sub-billion parameter language models optimized for on-device use cases, demonstrating significant performance improvements compared to existing open-source models [4][8]. Group 1: Model Performance and Features - The MobileLLM-R1 series includes three base models: MobileLLM-R1-140M, MobileLLM-R1-360M, and MobileLLM-R1-950M, which are not general chat models but are supervised fine-tuned (SFT) for specific tasks such as mathematics, programming (Python, C++), and scientific questions [6][8]. - The largest model, MobileLLM-R1-950M, was pre-trained using approximately 2 trillion high-quality tokens, achieving performance comparable to models trained on 36 trillion tokens, such as Qwen3 0.6B [8]. - MobileLLM-R1-950M outperforms existing models in various benchmarks, achieving five times higher accuracy on the MATH benchmark compared to the Olmo 1.24B model and twice as high as the SmolLM2 1.7B model [10]. Group 2: Model Architecture and Efficiency - The architecture of the MobileLLM-R1 models includes varying layers and parameters, with MobileLLM-R1-950M having 22 layers and 949 million parameters, while the smaller models have 15 layers and 140 million to 360 million parameters [14]. - The models are designed for text input and output, with a context length of 4k for base models and 32k for final models, supporting a vocabulary size of 128k [15]. Group 3: Research and Development Team - The development of the MobileLLM-R1 series was led by a team of researchers, including Zechun Liu, Ernie Chang, and Changsheng Zhao, who have extensive backgrounds in natural language processing and model optimization [18][21][30]. - The project took a year to develop, focusing on efficient deployment and optimization of large language models for resource-constrained environments [18][22].
AI 硬件,将带来下一个「苹果」还是昙花一现?
机器之心· 2025-09-13 01:30
Group 1: Core Insights - The article discusses the potential shift from smartphones to AI hardware, suggesting that the next major leap in consumer technology may come from a revolutionary device that could render smartphones obsolete [5][6]. - Major tech companies like Meta, OpenAI, Apple, and Google are positioning themselves in the AI hardware space, with a focus on devices that integrate AI capabilities as foundational infrastructure [8]. Group 2: AI Hardware Landscape - The global wearable technology market is projected to grow from approximately $120 billion in 2023 to around $158 billion in the coming years, indicating a significant expansion in the AI hardware sector [9]. - Various innovative AI hardware products are emerging, including smart glasses, health-monitoring rings, and AI-enabled earbuds, showcasing diverse interaction forms and functionalities [9]. Group 3: Company Strategies - Meta plans to release multiple tiers of AI glasses within the next five years, emphasizing the importance of AI functionality for future cognitive advantages [5]. - OpenAI is collaborating with former Apple designer Jony Ive to launch a next-generation portable device by 2026 that relies solely on cameras and microphones for interaction [5]. - Google is developing new AI assistants and Android XR glasses, aiming to enhance user experience through real-time interaction and improved language understanding [7].
扩散语言模型也有MoE版本了!蚂蚁&人大从头训练LLaDA-MoE,即将完全开源
机器之心· 2025-09-12 11:31
Core Viewpoint - The article discusses the development of the LLaDA-MoE model, the first native MoE architecture diffusion language model trained from scratch, which demonstrates significant performance and efficiency advantages over traditional autoregressive models [2][15][18]. Group 1: Model Development and Performance - The LLaDA-MoE model was trained on 20 terabytes of data and features 1.4 billion active parameters, achieving performance comparable to denser autoregressive models like Qwen2.5-3B while maintaining faster inference speeds [15][17][29]. - The LLaDA series has rapidly evolved, with LLaDA-MoE being a notable milestone, surpassing previous models like LLaDA1.0/1.5 and Dream-7B in various benchmark tests [13][18][29]. - The model's architecture allows for significant scaling potential, with plans to explore higher sparsity ratios and larger MoE diffusion language models [29][40]. Group 2: Technical Innovations and Advantages - The diffusion model approach allows for parallel decoding, bidirectional modeling, and iterative correction, addressing limitations of autoregressive models such as serial bottlenecks and lack of error correction capabilities [38][40]. - Evidence suggests that diffusion language models can achieve better learning outcomes than autoregressive models, particularly in scenarios with limited data, demonstrating a data utilization efficiency that can exceed three times that of autoregressive models [40][41]. - The training framework and infrastructure developed by Ant Group, including the ATorch framework, supports the efficient training of large-scale MoE models [25][26]. Group 3: Strategic Vision and Future Directions - The development of LLaDA-MoE reflects a strategic choice to explore high-potential areas in AI, moving beyond established paths to enhance the limits of intelligence [44][47]. - Ant Group's commitment to innovation is evident in its previous projects and ongoing research in areas like dynamic MoE architectures and hybrid linear architectures, all aimed at achieving general artificial intelligence (AGI) [45][46][47].