机器之心
Search documents
LeCun和哈萨比斯「吵」起来了:「通用智能」到底存不存在?
机器之心· 2025-12-23 07:06
编辑|+0 今天 AI 圈最大的热闹莫过于 LeCun 和 哈萨比斯 在推上「吵」了起来。 事情的起因,源于 Yann LeCun 最近的一次「火力全开」。前段时间,一位博主发布了一段 LeCun 的访谈剪辑,LeCun 在播客节目中表示: 「通用智 能」不存在,是彻头彻尾的胡说八道。 播客地址: https://www.youtube.com/watch?v=7u-DXVADyhc 他的观点很犀利:我们之所以会有「通用」的错觉,是因为陷入了幸存者偏差,我们只能意识到那些我们能想象出的问题,却忽略了海量位于人类认知盲区 之外的、我们根本无法构想的任务。人类在棋类等任务上表现差劲,并且动物在许多其他领域胜过我们。 以下是 LeCun 的原话: 根本就不存在所谓的通用智能(general intelligence)。这个概念完全讲不通,因为它实际上是被设计用来指代人类水平的智能的。 但人类智能其实是高度专用化的。好吧,我们确实能很好地应对现实世界,比如导航以及诸如此类的事情。我们也能很好地应对其他人,因 为这是我们进化的结果。但在国际象棋方面,我们却表现得很差。所以,实际上有很多任务我们都做不好,而在这些方面很多 ...
仅需15%全量Attention!「RTPurbo」阿里Qwen3长文本推理5倍压缩方案来了
机器之心· 2025-12-23 04:15
为什么大模型厂商给了 128K 的上下文窗口,却在计费上让长文本显著更贵? 为什么 Claude 能 "吞下整本书",但官方示例往往只展示几千字的文档? 为什么所有大模型厂商都在卷 "更长上下文",而真正做落地的产品经理却天天琢磨 "怎么把用户输入变短"? 这些看似矛盾的现象,其实答案藏在一个长期被技术光环遮掩的真相里: 长序列,正在成为大模型应用里最昂贵的奢侈品 。 在当前主流的 Full Attention 机制下,计算开销会随着输入长度平方增长,序列一长,处理就变得 "又贵又慢"(见图 1)。针对这一核心难题,阿里 RTP-LLM 团队 提出了一种全新的后训练压缩方案: RTPurbo 。在不损失模型效果的前提下,实现了 Attention 计算 5 倍 压缩(见图 2)。 左图 1 :长序列 Attention 计算成本瓶颈;右图 2 : RTPurbo 极大降低 Attention 计算开销 总的来说, RTPurbo 采用了一种 非侵入式的压缩方法 :通过分辨 LLM 内部的长程 Attention Head,仅保留关键 Head 的全局信息,对于剩下冗余的 Head 直接丢弃 远程 Tokens ...
VideoCoF:将「时序推理」引入视频编辑,无Mask实现高精度编辑与长视频外推!
机器之心· 2025-12-23 04:15
Core Insights - The article discusses the innovative video editing framework VideoCoF, which addresses the dilemma of achieving high precision without relying on masks, a common limitation in existing models [2][4][28] - VideoCoF utilizes a "See-Reason-Edit" approach inspired by large language models (LLMs), allowing for effective video editing with only 50k training samples, achieving state-of-the-art (SOTA) results [5][14][28] Group 1: Pain Points and Innovations - Existing video editing models face a trade-off between high precision and general applicability, with expert models requiring masks and general models lacking accuracy [3][7] - VideoCoF introduces the Chain of Frames (CoF) mechanism, restructuring the video editing process into three stages: Seeing, Reasoning, and Editing, which enhances the model's ability to establish relationships between editing instructions and video regions [6][8] Group 2: Technical Mechanisms - The framework incorporates a unique RoPE (Rotary Position Encoding) alignment strategy, enabling the model to handle longer videos during inference while maintaining smooth motion and avoiding artifacts [11][16] - VideoCoF demonstrates remarkable data efficiency, achieving superior performance with only 50k video pairs compared to baseline models that require significantly larger datasets [12][17] Group 3: Experimental Validation - In experiments, VideoCoF achieved an instruction-following score of 8.97, outperforming other models like ICVE (7.79) and VACE (7.47), indicating its superior understanding of user instructions [14][19] - The success ratio of VideoCoF reached 76.36%, significantly higher than commercial models like Lucy Edit (29.64%) and ICVE (57.76%) [18][19] Group 4: Reasoning Frame Design - The design of the reasoning frame is crucial; experiments showed that a progressive gray mask significantly improved instruction-following scores compared to static masks [21][26] - The introduction of the CoF mechanism and RoPE design led to notable improvements in both fidelity and temporal consistency for long video extrapolation [24] Group 5: Practical Applications - VideoCoF showcases versatile editing capabilities, including multi-instance removal, object addition, instance replacement, and localized style transfer, demonstrating its potential for various video editing tasks [29]
拿走200多万奖金的AI人才,到底给出了什么样的技术方案?
机器之心· 2025-12-23 04:15
Core Viewpoint - The article emphasizes the significant opportunities for young individuals proficient in AI technology in China, particularly highlighted by the recent Tencent Advertising Algorithm Competition, which showcased innovative solutions to complex advertising challenges [2][5]. Group 1: Competition Overview - The Tencent Advertising Algorithm Competition revealed that all top 10 teams received job offers from Tencent, with the champion team awarded a prize of 2 million yuan [2]. - The competition focused on a real-world problem in advertising that lacks a definitive solution, pushing participants to explore practical and innovative approaches [4][5]. Group 2: Advertising Challenges - Advertising is often viewed negatively, but it is essential for the sustainability of many services and content, leading platforms to seek smarter, less intrusive advertising methods [7]. - The competition addressed how to make advertising more targeted and relevant, reducing unnecessary exposure to users [7][16]. Group 3: Methodologies in Advertising - Two primary methodologies in advertising recommendation systems were discussed: traditional discriminative methods and emerging generative methods [8]. - Discriminative methods focus on matching user profiles with ads based on predefined features, while generative methods analyze user behavior over time to predict future interactions [9][14]. Group 4: Competition Challenges - Participants faced challenges related to the scale of data, involving millions of ads and users, while having limited computational resources [21]. - The complexity of the data structure, including multimodal historical behavior data, added to the difficulty of modeling user interactions effectively [21][22]. Group 5: Champion Team Solutions - The champion team, Echoch, introduced a three-tier session system, periodic encoding, and time difference bucketing to enhance the model's understanding of user behavior over time [28][29]. - They developed a unified model capable of switching strategies between predicting clicks and conversions, addressing the differing objectives of these actions [34][36]. - The team also incorporated randomness in ad encoding to improve exposure for less popular ads, significantly increasing their training focus [37]. Group 6: Runner-Up Team Solutions - The runner-up team, leejt, tackled the challenge of handling large-scale data by compressing the vocabulary size and using shared embeddings for low-frequency ads [42]. - They implemented session segmentation and heterogeneous temporal graphs to manage the complexity of user behavior data effectively [44]. - The team optimized engineering processes to maximize GPU utilization, achieving significant performance improvements in model training [48]. Group 7: Industry Implications - The competition highlighted the transition from discriminative to generative models in advertising, with Tencent already implementing generative models in its internal systems, yielding positive results reflected in financial data [51]. - Tencent plans to open-source the competition data to foster community development and explore the potential of real-time personalized advertising generation in the future [52].
技术革新+生态赋能:多彩新媒构建智慧广电新标杆
机器之心· 2025-12-23 04:15
Core Viewpoint - Guizhou's IPTV has rapidly evolved from a late starter to a leading example of value creation in the industry, with a household coverage rate of nearly 89% and over 11.295 million users [1][2]. Group 1: Growth and Innovation - Guizhou's Duocai New Media Co., Ltd. has transformed from a "follower" to a "benchmark for value creation" in less than ten years, driven by deep technological innovations [2]. - The growth of Duocai New Media is rooted in various technological advancements, including terminal architecture innovation, ultra-high-definition technology implementation, and upgrades to its operational systems [2][3]. Group 2: User Experience Enhancement - Duocai New Media has shifted the broadcasting service focus from merely meeting functional needs to reconstructing user experience [3]. - The company has addressed long-standing user experience issues in IPTV, particularly in content search and recommendation efficiency, by integrating advanced algorithms and optimizing content retrieval technology [5][8]. - The time taken for content retrieval has been reduced from approximately 15 seconds to under 2 seconds, significantly improving user satisfaction [9]. Group 3: Operational Efficiency - In 2024, Duocai New Media restructured its broadcasting control operations, integrating AI capabilities and automation technologies to create a comprehensive intelligent operation system [12]. - The automation of core operational processes has led to a 56% reduction in manual steps and nearly doubled the efficiency of basic business processing [14]. - The company has developed two related patents, further solidifying its technological barriers [16]. Group 4: Terminal and Ultra-High-Definition Experience - Duocai New Media has restructured its terminal technology architecture to enhance user interaction and cross-device experience, addressing issues of single interaction and device compatibility [18]. - The introduction of a new product, "Mobile Super TV," has improved user engagement by 28% in its first month, allowing seamless content transition across devices [21]. - The ultra-high-definition service has been upgraded with a quality mode switch feature, enabling users to select between 4K and 1080P based on their network conditions [22]. Group 5: Ecosystem and Security Foundations - Duocai New Media has established a dual middle-platform architecture to address fragmented business systems and data silos, reducing infrastructure costs by 25% while supporting real-time data processing for millions of users [26]. - The company has implemented a visual intelligent operation platform that monitors nearly 2,000 virtual servers and 1,000 live broadcast channels, achieving a 92% accuracy rate in preemptive risk identification [27]. - A comprehensive content quality monitoring system has been developed to ensure zero-fault operation of current products [28]. Group 6: Industry Recognition and Value Creation - Duocai New Media has received multiple industry accolades, with a total of 6 patents and 107 software copyrights, contributing to national and provincial technology projects and standards [30]. - The company's technological innovations have created significant industry and social value, including targeted marketing and the promotion of cultural resources through 4K technology [32]. - The exploration of these innovations demonstrates that the digital transformation of the broadcasting industry is not just about technology but also about a user-centered, comprehensive reconstruction of services [32][33].
最火、最全的Agent记忆综述,NUS、人大、复旦、北大等联合出品
机器之心· 2025-12-22 09:55
Core Insights - The article discusses the evolution of memory systems in AI agents, emphasizing the transition from optional modules to essential infrastructure for various applications such as conversational assistants and code engineering [2] - A comprehensive survey titled "Memory in the Age of AI Agents: A Survey" has been published by leading academic institutions to provide a unified perspective on the rapidly expanding yet fragmented concept of "Agent Memory" [2] Forms of Memory - The survey categorizes agent memory into three main forms: token-level, parametric, and latent memory, focusing on how information is represented, stored, and accessed [16][24] - Token-level memory is defined as persistent, discrete units that are externally accessible and modifiable, making it the most explicit form of memory [18] - Parametric memory involves storing information within model parameters, which can lead to challenges in retrieval and updating due to its flat structure [22] - Latent memory exists in the model's internal states and can be continuously updated during inference, providing a compact representation of memory [24][26] Functions of Memory - The article identifies three core functions of agent memory: factual memory, experiential memory, and working memory [29] - Factual memory aims to provide a stable reference for knowledge acquired from user interactions and environmental facts, ensuring consistency across sessions [31] - Experiential memory focuses on accumulating knowledge from past interactions to enhance problem-solving capabilities, allowing agents to learn from experiences [32] - Working memory manages information within single task instances, addressing the challenge of processing large and complex inputs [35] Dynamics of Memory - The dynamics of memory encompass three stages: formation, evolution, and retrieval, which form a feedback loop [38] - The formation stage encodes raw context into more compact knowledge representations, addressing computational and memory constraints [40] - The evolution stage integrates new memories with existing ones, ensuring coherence and efficiency through mechanisms like pruning and conflict resolution [43] - The retrieval stage determines how memory can assist in decision-making, emphasizing the importance of effective querying strategies [41] Future Directions - The article suggests that future memory systems should be viewed as a core capability of agents rather than mere retrieval plugins, integrating memory management into decision-making processes [49][56] - There is a growing emphasis on automating memory management, allowing agents to self-manage their memory operations, which could lead to more robust and adaptable systems [54][62] - The integration of reinforcement learning into memory control is highlighted as a potential pathway for developing more sophisticated memory systems that can learn and optimize over time [58][60]
旧金山大停电,Waymo自动驾驶汽车瘫痪,特斯拉赢麻了
机器之心· 2025-12-22 08:17
编辑:冷猫 上周六,美国旧金山发生了一次大规模的停电事故。 这次停电事件似乎是由该市一处 太平洋煤气与电力公司(Pacific Gas & Electric,PG&E)变电站发生火灾所引发的。SFGate 报道称,约有 12 万名用户受到此次停 电影响。 不过,我们关注的并不只是停电事故,而是停电导致的一个非常诡异的现象: Waymo 自动驾驶汽车在停电期间大规模瘫痪。 Waymo 在周六表示,由于停电,该市暂时停止了服务。直到周日下午晚些时候,Waymo 发言人表示公司已恢复服务。 在此期间,有大量照片和视频被发布到社交媒体上,拍下了 Waymo 自动驾驶出租车在道路中间和十字路口停车。也说明了这一瘫痪现象并非个例,而是普遍现 象。 人类司机要么被堵在它们后面,要么绕过它们,造成了严重的交通堵塞现象。 Nic Cruz Patane @niccruzpatane Subscribe Q This is insane, Waymo vehicles all over SF are bricked due to a power outage. 这太疯狂了,由于停电,旧金山的所有 Waymo 车辆都变砖了。 0: ...
RL加持的3D生成时代来了!首个「R1 式」文本到3D推理大模型AR3D-R1登场
机器之心· 2025-12-22 08:17
强化学习(RL)在大语言模型和 2D 图像生成中大获成功后,首次被系统性拓展到文本到 3D 生成领域!面对 3D 物体更高的空间复杂性、全局几何一致 性和局部纹理精细化的双重挑战,研究者们首次系统研究了 RL 在 3D 自回归生成中的应用! 强化学习应用于 3D 生成的挑战 来自上海人工智能实验室、西北工业大学、香港中文大学、北京大学、香港科技大学等机构的研究者提出了 AR3D-R1 ,这是首个强化学习增强的文本到 3D 自回归模型。该工作系统研究了奖励设计、RL 算法和评估基准,并提出 Hi-GRPO ——一种层次化强化学习范式,通过分离全局结构推理与局部纹理 精修来优化 3D 生成。同时引入全新基准 MME-3DR ,用于评估 3D 生成模型的隐式推理能力。 实验表明 AR3D-R1 在 Kernel Distance 和 CLIP Score 上均取得显著提升,达到 0.156 和 29.3 的优异成绩。 论文标题:Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation 代码链接: https://github. ...
智能体如何学会「想象」?深度解析世界模型嵌入具身系统的三大技术范式
机器之心· 2025-12-22 04:23
Core Insights - The article discusses the integration of world models into embodied intelligent systems, emphasizing the shift from reactive loops to predictive capabilities [2][10] - It highlights the importance of world models in enhancing sample efficiency, long-term reasoning, safety, and proactive planning in embodied agents [11][12] Summary by Sections Introduction to World Models - Embodied intelligent systems traditionally relied on a "perception-action" loop, lacking the ability to predict future states [2] - The introduction of world models allows agents to "imagine" future scenarios, enhancing their operational capabilities [10] Research Overview - A comprehensive survey from a research team involving multiple universities presents a framework for integrating world models into embodied systems [5][7] - The paper categorizes existing research into three paradigms based on architectural integration [5][14] Paradigm Classification - The relationship between world models (WM) and policy models (PM) is described as a "coupling strength spectrum," ranging from weak to strong dependencies [15] - Three categories are identified: Modular, Sequential, and Unified architectures, each with distinct characteristics [15][16] Modular Architecture - In this architecture, WM and PM operate as independent modules with weak coupling, focusing on causal relationships between actions and states [20] - The world model acts as an internal simulator, allowing agents to predict outcomes based on potential actions [20] Sequential Architecture - This architecture involves a two-stage process where WM predicts future states, and PM executes actions based on those predictions [21] - The world model generates a valuable goal, simplifying complex long-term tasks into manageable sub-problems [22][23] Unified Architecture - The unified architecture integrates WM and PM into a single end-to-end network, allowing for joint training and optimization [24][25] - This configuration enables the agent to anticipate future states and produce appropriate actions without explicitly separating simulation and decision-making [25] Future Directions - The article outlines potential research directions, including the representation space of world models, structured intent generation, and the balance between interpretability and optimality [27][28][29] - It emphasizes the need for effective alignment mechanisms to ensure performance while exploring unified world-policy model paradigms [29]
陈天桥旗下盛大AI东京研究院于SIGGRAPH Asia正式亮相,揭晓数字人和世界模型成果
机器之心· 2025-12-22 04:23
Core Insights - Shanda Group's Shanda AI Research Tokyo made its debut at SIGGRAPH Asia 2025, focusing on "Interactive Intelligence" and "Spatiotemporal Intelligence" in digital human research, reflecting the long-term vision of founder Chen Tianqiao [1][10] - The article discusses the systemic challenges leading to the "soul" deficiency in current digital human interactions, which is a significant barrier to user engagement despite substantial investments in visual effects [2][3] Systemic Challenges - **Long-term Memory and Personality Consistency**: Current large language models (LLMs) struggle with maintaining a stable personality over extended conversations, leading to "persona drift" and inconsistent narrative logic [3] - **Lack of Multimodal Emotional Expression**: Digital humans often exhibit "zombie-face" phenomena, lacking natural micro-expressions and emotional responses, which diminishes immersive experiences [3] - **Absence of Self-evolution Capability**: Most digital humans operate as passive systems, unable to learn from interactions or adapt to user preferences, hindering their evolution into truly intelligent entities [3] Industry Consensus - Experts at the SIGGRAPH Asia conference reached a consensus that the bottleneck in digital human development has shifted from visual fidelity to cognitive and interaction logic, emphasizing the need for long-term memory, multimodal emotional expression, and self-evolution as core competencies [13][10] Introduction of Mio - Shanda AI Tokyo Research introduced Mio (Multimodal Interactive Omni-Avatar), a framework designed to transform digital humans from passive entities into intelligent partners capable of autonomous thought and interaction [16][22] - Mio's architecture includes five core modules: Thinker (cognitive core), Talker (voice engine), Facial Animator, Body Animator, and Renderer, which work together to create a seamless interaction loop [20][21] Performance Metrics - Mio achieved an overall Interactive Intelligence Score (IIS) of 76.0, representing an 8.4 point improvement over previous technologies, setting a new performance benchmark in the industry [25][22] Future Outlook - The development of Mio signifies a paradigm shift in digital human technology, moving focus from static visual realism to dynamic, meaningful interactive intelligence, with potential applications in virtual companionship, interactive storytelling, and immersive gaming [22][25] - Shanda AI Tokyo Research has made the complete technical report, pre-trained models, and evaluation benchmarks of the Mio project publicly available to foster collaboration in advancing this field [28]