机器之心
Search documents
大模型听懂语音却反而变笨?港中深与微软联合解决语音大模型降智问题
机器之心· 2026-01-17 03:24
Core Insights - The article discusses the challenges faced by Speech Large Language Models (LLMs) in maintaining logical reasoning capabilities when transitioning from text to speech input, a phenomenon termed the "Modality Reasoning Gap" [2][3][10] - Major tech companies like OpenAI, Google, and Meta are grappling with this issue, as evidenced by a significant drop in accuracy from 92% in text-to-text tasks to 66% in speech-to-speech tasks for models like GPT-4o [3] - The article introduces TARS (Trajectory Alignment for Reasoning in Speech), a new framework developed by Hong Kong University of Science and Technology and Microsoft, which utilizes reinforcement learning to align reasoning processes in speech input with those in text input, effectively restoring and even surpassing reasoning capabilities [7][30] Group 1: Challenges in Speech LLMs - The introduction of speech input leads to a drastic decline in reasoning ability, with a noted 26% drop in accuracy when switching from text to speech [3][10] - Existing methods to bridge this gap, such as input alignment and output memorization, have proven inadequate due to the inherent differences between speech and text [11][12] - The article highlights the concept of "Multimodal Tax," where the inclusion of audio data detracts from the model's pure reasoning capabilities [3] Group 2: TARS Framework Innovations - TARS employs a novel approach using on-policy reinforcement learning to dynamically align the reasoning trajectories of speech and text, rather than relying on static memorization [12][30] - Key innovations in TARS include: - **Representation Alignment**: This involves calculating the cosine similarity of hidden states between speech and text inputs at each layer, providing a reward for maintaining alignment [15][16] - **Behavior Alignment**: Instead of requiring exact token matches, TARS assesses semantic consistency using external embedding models, allowing for more flexible output [17][21] - **Asymmetric Reward and Modality Normalization**: TARS implements a reward system that incentivizes the speech branch to catch up with the text branch while normalizing rewards to ensure continuous improvement [22][23] Group 3: Experimental Results and Impact - TARS has demonstrated a 100% restoration of reasoning capabilities in speech models, achieving significant performance improvements on challenging benchmarks [24][28] - The framework has shown that the reasoning ability of speech models can not only match but exceed that of text models, with a mean reciprocal rank (MRR) of 100.45% achieved in experiments [33] - TARS has outperformed existing state-of-the-art methods, establishing itself as a leading solution in the field of speech LLMs [33]
开源8300小时标注数据,新一代实时通用游戏AI Pixel2Play发布
机器之心· 2026-01-17 03:24
Core Insights - The article discusses the advancements in AI models for gaming, particularly focusing on the Pixel2Play (P2P) model developed by researchers at Player2, which aims to enhance AI's performance in real-time gaming environments [2][5]. Group 1: Model Development - The P2P model utilizes game visuals and text instructions as inputs to generate corresponding keyboard and mouse operation signals, achieving over 20Hz end-to-end inference speed on consumer-grade RTX 5090 graphics cards [2]. - P2P has been trained on over 40 games, totaling more than 8300 hours of gameplay data, and can play multiple games on Roblox and Steam in a zero-shot manner [2]. - The model employs a lightweight framework and is built from scratch, featuring a decoder Transformer and a lightweight action-decoder to enhance inference speed by five times [10]. Group 2: Training Data and Open Source - High-quality "visual-action" data is scarce online, prompting the Open-P2P project to open-source all training datasets to fill this gap [5][3]. - The training data includes game images, text instructions, and precise keyboard and mouse operation annotations, which are crucial for training effective game AI models [8][5]. Group 3: Model Evaluation - P2P has been evaluated using four different model sizes, with parameters ranging from 150M to 1.2B, achieving inference speeds of 80Hz for the 150M model and 40Hz for the 1.2B model [12]. - In human evaluations, the 1.2B model showed a preference rate of 80%, 83%, and 75% over smaller models in various games, indicating superior performance [13]. - The model's ability to follow text instructions significantly improved its success rate in tasks, demonstrating strong understanding and execution capabilities [15]. Group 4: Causal Reasoning - The article highlights the challenge of causal confusion in behavior cloning, particularly in high-frequency interaction environments, and notes that increasing model size and training data can enhance the model's understanding of causal relationships [17]. - As training data and model parameters increase, the P2P model's performance in causal inference assessments shows a positive trend [19].
贴广告的ChatGPT,一夜之间让全球网友破了防
机器之心· 2026-01-17 03:24
编辑|泽南、杨文 这一天终于还是来了。 对此,网友们感到很受伤。有人表示,现在大家用大模型的一个重要原因就是能够避免广告,更好地查询信息,现在 ChatGPT 又把广告加回来是几个意思? 也有人认为,加广告的这件事表明了 OpenAI 目前的营收压力很大。 周六凌晨,OpenAI 的一则公告引起轩然大波:他们计划在 ChatGPT 里加广告了。 华盛顿大学教授荣誉退休教授、知名 AI 学者 Pedro Domingos 吐槽道:OpenAI 终于实现了 AGI,不过此 AGI 非彼 AGI,而是 Ad-Generated Income. OpenAI 的公告指出,广告测试将在未来几周内率先在美国启动,能看到广告的用户包括免费版,还包括一种新的付费层级 ——ChatGPT Go 的用户。 ChatGPT「小会员」,每月 8 美元 在广告出现之前,OpenAI 官方宣布 ChatGPT Go 已在全球上线,在所有支持 ChatGPT 的国家可用。 ChatGPT Go 是他们的低价订阅计划,每月 8 美元,提供比免费版多 10 倍的消息额度、文件上传和图像生成功能、更大的内存、更长的上下文窗口,以及可以无 限使用 ...
黄仁勋年初对话:2025 的 AI 如何塑造产业的「五层蛋糕」?
机器之心· 2026-01-17 02:30
Group 1: Core Views - Huang Renxun emphasizes that AI is not merely replacing human jobs but reshaping tasks and purposes within work [1] - The cost of AI is decreasing at an annual rate exceeding 10 times, which challenges the narrative of an "AI bubble" [1] Group 2: Five-Layer Cake Model - Huang Renxun introduces the "Five-Layer Cake" model, which outlines a complete value transformation chain from energy to application [5][9] - The model consists of energy conversion and chips as the physical foundation, extending to infrastructure layers that integrate data centers, power, and software orchestration [9] - The core model layer focuses on understanding diverse information, not limited to chatbots, while the top layer includes applications in autonomous driving and robotics [9][10] Group 3: Token Economics - AI's evolution is driven by the MoE (Mixture of Experts) architecture, which allows for a significant reduction in training and inference costs [7] - Huang Renxun predicts that the cost of token generation will decrease by a billion times over the next decade due to hardware performance upgrades and continuous optimization of algorithms and models [6] - High-value tokens, such as Open Evidence, have demonstrated high profit margins, with gross margins reaching 90% [6] Group 4: Open Source and Innovation - The open-source ecosystem plays a crucial role in accelerating technological dissemination and innovation by removing barriers to entry [10] - Open-source models allow startups and research institutions to leverage existing models for innovation, significantly shortening the R&D timeline [10][12] - Initiatives like DeepSeek validate the synergy between high-performance MoE models and hardware, helping to bridge the technology gap with closed-source solutions [11] Group 5: AI and Sustainable Energy - AI is driving substantial industrial growth by pushing the chip, supercomputing, and smart factory supply chains from virtual to real [13] - Huang Renxun identifies energy as a core issue for new industrial development, with AI acting as a powerful force in the global transition to sustainable energy [13]
失去三个联创后,Mira公司危机持续:又有两人要出走
机器之心· 2026-01-16 08:13
编辑|张倩 继奥特曼在 OpenAI 的「宫斗」大戏后,他的老搭档 Mira 这周的经历也够拍一部电视剧了。 如果再加上之前离开的 PyTorch 大神 Andrew Tulloch,Thinking Machines Lab 目前已失去了三位联创,可谓创业未半,核心团队就散了。 今天,事件还在继续发酵,该实验室的另外两位技术骨干 —— 基础设施工程师 Ian O'Connell 和研究模型架构的研究员 Lia Guy 也被爆出将要离开,后者明确也要 回 OpenAI。 多家媒体将此事件描述为「OpenAI 对 Thinking Machines Lab 的人才突袭(raid)」。据《连线》报道,这次挖人行动已在 OpenAI 内部筹备数周。 昨天,我们报道了前 OpenAI CTO Mira Murati 创办的 Thinking Machines Lab 出现 重大人事变动 的消息: 联合创始人兼 CTO Barret Zoph 被解雇,另一位联创 Luke Metz 以及创始团队成员 Sam Schoenholz 也一起离开,三人一起回归 OpenAI 。 而且,这里面还有一些难辨真假的纠葛。知情人士透 ...
不止于量化:最新综述用「时-空-构」三维视角解构KV Cache系统级优化
机器之心· 2026-01-16 08:13
随着 LLM 向 1M 上下文演进, KV cache(键值缓存) 已成为制约推理服务效率的核心瓶颈。自回归生成的特性使得模型必须存储历史 token 的 key-value 状态 (即 KV cache)以避免重复计算,但 KV cache 的显存占用随着上下文长度的增长而膨胀,带来显著的内存瓶颈。 过去两年,关于 KV cache 的优化工作爆炸式增长,包括调度、迁移、压缩等策略层出不穷。然而,现有综述主要聚焦于 LLM 推理或服务的整体效率,大多仅将 KV cache 作为其中一个子模块作简要讨论。 近期,来自 墨尔本大学和华中科技大学的研究者们 发布了一篇深度综述,从 MLSys 的思维 出发,用一套新颖的 「时间 - 空间 - 结构」系统行为视角 对 KV cache 优化方法进行了系统性梳理与深入分析,并将相关资源整理成了 持续维护的 Awesome 资源库, 方便研究者与从业人员快速定位与落地。 什么是「 sKis」? 为了提供更聚焦的视角和理解,作者们首先在综述中定义了 sKis 的边界:在推理服务阶段,以 KV cache 为核心优化对象,在不依赖模型重训或结构修改的前提 下,提升吞吐、延迟 ...
美团又上新模型,8个Thinker齐开工,能顶个诸葛亮?
机器之心· 2026-01-16 08:13
Core Insights - The article discusses the latest advancements in AI models, specifically focusing on Meituan's LongCat-Flash-Thinking-2601, which features 560 billion parameters and is built on an innovative MoE architecture [1][41][62] - The model introduces a Heavy Thinking Mode that allows for simultaneous multi-path reasoning, enhancing the reliability and comprehensiveness of conclusions [4][48][62] - LongCat-Flash-Thinking-2601 demonstrates significant improvements in agent capabilities, achieving top performance in various benchmark tests and showing enhanced generalization in out-of-distribution (OOD) scenarios [6][62] Model Features - LongCat-Flash-Thinking-2601 employs a Heavy Thinking Mode that activates eight independent thinkers to explore different reasoning paths, thereby reducing errors and improving answer quality [4][48][50] - The model's architecture supports parallel thinking and iterative summarization, allowing for a broader and deeper exploration of complex problems [41][50] - A new evaluation method for agent model generalization has been introduced, which generates complex tasks based on given keywords, enhancing the model's adaptability to unknown scenarios [8][10][11] Performance Testing - Real-world testing of the model showed its capability in logical reasoning tasks, where it effectively utilized the Heavy Thinking Mode to arrive at reliable answers through collaborative reasoning [12][15][16] - The model's programming abilities were tested by generating games like Flappy Bird and Conway's Game of Life, showcasing its versatility despite the high computational cost of using multiple thinkers [26][32][32] - In a comparative analysis with Claude 4.5 Opus, LongCat-Flash-Thinking-2601 achieved a 100% standard coverage rate, outperforming its competitor in handling complex tool dependencies [38][62] Technological Innovations - The model incorporates advanced techniques such as environment scaling and multi-environment reinforcement learning, which enhance its training and performance in diverse scenarios [41][51][53] - LongCat's training process includes the introduction of noise to improve robustness, allowing the model to perform well in real-world conditions that are often imperfect [60][62] - The upcoming LongCat ZigZag Attention mechanism aims to support a context of up to 1 million tokens, further expanding the model's capabilities [63] Development Timeline - Meituan's AI model development has been rapid, with consistent updates since its initial launch in September 2025, focusing on enhancing response speed, logical reasoning, and multi-modal capabilities [65][67] - The company aims to create a model that can effectively solve real-world problems, aspiring towards a future where "model as a service" becomes a reality [68]
刚刚,Geoffrey Hinton成为第二位引用量破百万的科学家
机器之心· 2026-01-16 01:55
Core Viewpoint - Geoffrey Hinton has officially become the second computer scientist in history to surpass 1 million citations on Google Scholar, marking a significant milestone in his academic career and contributions to artificial intelligence [1][3]. Group 1: Academic Achievements - Hinton's citation count currently stands at 1,000,083, with an h-index of 192, indicating his substantial impact in the field of computer science and artificial intelligence [2]. - He is renowned for his work on backpropagation, which addressed the training challenges of multilayer neural networks, laying the groundwork for the deep learning revolution [10]. - Hinton, along with Yoshua Bengio and Yann LeCun, received the Turing Award in 2018, recognizing their pivotal contributions to the field of deep learning [13]. Group 2: Key Contributions - Hinton's notable innovations include the Boltzmann Machine, Restricted Boltzmann Machine, Deep Belief Network, Dropout technique, t-SNE for data visualization, Capsule Networks, and Knowledge Distillation, among others [14]. - His collaboration on AlexNet, which won the ImageNet competition in 2012, is considered a landmark moment that demonstrated the power of deep learning [16]. - The paper "Deep Learning," co-authored by Hinton, has garnered over 100,000 citations, summarizing the evolution and principles of deep learning [16]. Group 3: Personal Background and Career - Born into an academic family, Hinton's early life was marked by high expectations, which shaped his relentless pursuit of knowledge [5][8]. - He moved to Canada in the 1980s, where he established a long-term academic career at the University of Toronto, contributing significantly to the development of AI in Canada [9]. - Hinton's later years have seen him express concerns about the potential risks of AI, emphasizing the need for caution in its development [20]. Group 4: Legacy and Impact - Hinton's citation milestone reflects not only his individual achievements but also the collaborative efforts of his students, Alex Krizhevsky and Ilya Sutskever, who have also made significant contributions to AI [29]. - The historical context of Hinton's work illustrates the broader narrative of humanity's quest to understand intelligence, highlighting the transformative impact of his research on modern AI [31].
腾讯AngelSlim升级,首个集LLM、VLM及语音多模态为一体的投机采样训练框架,推理速度飙升1.8倍
机器之心· 2026-01-16 01:55
Core Insights - The article discusses the challenges of high inference costs and delays in the large model application landscape, highlighting the need for cost reduction and efficiency improvements in the industry [2] - Speculative sampling is introduced as a novel inference acceleration paradigm that offers near-lossless speedup, gaining popularity in the industry [2] - Tencent's upgraded AngelSlim training framework leverages speculative sampling to enhance performance across various modalities, achieving significant inference speed improvements [2] Group 1: AngelSlim and Speculative Sampling - Speculative sampling utilizes a lightweight draft model to generate multiple candidate tokens, which are then verified by a larger model, effectively parallelizing the decoding process and reducing latency [4] - AngelSlim integrates various compression algorithms, including quantization and speculative sampling, to support multi-modal model training, achieving acceleration rates of 1.4 to 1.9 times [4][6] - The framework emphasizes deployment readiness, allowing models trained with AngelSlim to be seamlessly integrated into existing frameworks like vLLM and Sglang [7] Group 2: Key Features of AngelSlim - AngelSlim supports full-modal speculative sampling training, enabling shared core algorithms and engineering capabilities across different modalities [6] - The data processing module provides a stable and reusable data foundation for training across multiple modalities, including data resampling and preprocessing [12][13] - The model module features a unified TargetModel interface, allowing easy integration of new model architectures without modifying core algorithms [18] Group 3: Training Components and Performance - The training module is designed for both online and offline training modes, catering to different model sizes and memory constraints [20] - The training process includes training-time testing, allowing the model to learn from its own predictions during training [21] - AngelSlim's trained models have demonstrated acceleration performance in various tasks, achieving speedups of 1.4 to 1.9 times under specific conditions [25] Group 4: Future Plans - Future developments will focus on enhancing speculative sampling capabilities through tool and algorithm advancements, including offline hidden states generation and deeper integration of multi-modal features [30]
DeepSeek连发两篇论文背后,原来藏着一场学术接力
机器之心· 2026-01-16 00:42
编辑|张倩、陈陈 2026 年 1 月过半,我们依然没有等来 DeepSeek V4,但它的模样已经愈发清晰。 最近,DeepSeek 连发了两篇论文,一篇解决信息如何稳定流动,另一篇聚焦知识如何高效检索。 第一篇论文( mHC )出来的时候,打开论文的人都表示很懵,直呼看不懂,让 AI 助手用各种方式讲给自己听。我们也翻了翻网友的讨论,发现理解起来比较透 彻的办法其实还是要回到研究脉络,看看这些年研究者们是怎么接力的。要理解第二篇论文( Conditional Memory )也是如此。 于是,我们就去翻各路研究者的分析。这个时候,我们发现了一个有意思的现象:DeepSeek 和字节 Seed 团队的很多工作其实是存在「接力」的 —— mHC 在字节 Seed 团队 HC(Hyper-Connections)的基础上进行了重大改进;Conditional Memory 则引用了字节 Seed 的 OverEncoding、UltraMem 等多项工作。 如果把这些工作之间的关系搞清楚,相信我们不仅可以加深对 DeepSeek 论文的理解,还能看清大模型架构创新正在往哪些方向突破。 在这篇文章中,我们结合自己 ...