机器之心
Search documents
句子级溯源+生成式归因,C²-Cite重塑大模型可信度
机器之心· 2025-12-03 00:06
在人工智能快速发展的今天 ,大语言模型已经深入到我们工作和生活的方方面面。然而,如何让AI生成的内容更加可信、可追 溯, 一直是学术界和工业界关注的焦点问题。想象一下 ,当你向ChatGPT提问时,它不仅给出答案,还能像学术论文一样标注每 句话的信息来源——这就是"溯源大语言模型"要解决的核心问题。 北邮百家 AI团队 联合小米大模型团队 提出的 溯源大模型 C²-Cit e,首创上下文感知的归因 生成技术,不仅能让大模型在 生成内容时自动标注精准的信息来源,更能确保生成内容与引用的外部知识高度语义对齐,实现每一 处表述 都有溯源依据 、 与参考来源深度协同,从根本上解决大模型生成内容的可信度问题。 该工作 已被 国际顶级会议 WSDM 2026 收录 。 C²-Cit e 针对现 有 归因模型 存在的关 键缺陷 , 通过引入 "上下文感知"机制, 让引用标记从被动的占位符转变为带有上下 文语义的特殊令牌 , 显著提升了引用质量和 模型 回答准确性 。 论文标题: C²-Cite:Contextual-Aware Citation Generation for Attributed Large Languag ...
这下Altman急了,OpenAI紧急启动「红色警报」
机器之心· 2025-12-02 09:18
机器之心报道 编辑:陈陈、+0 ChatGPT 三周年刚刚过去,Sam Altman 却显得分外焦虑。 回想 2022 年底,当时 ChatGPT 全球爆红,谷歌 CEO Sundar Pichai 在公司内部发布了红色警报 (Code Red)以应对来自 OpenAI 的威胁。 讽刺的是,短短三年后,攻守之势异也。彼时慌乱的谷歌,如今正以惊人的速度重整旗鼓,Gemini 不 断迭代、多模态能力大幅增强,生图模型 Nano Banana Pro 全方位爆火……Anthropic、xAI 等公司 也在不同技术方向上快速追赶。 再看 OpenAI 这边,情况却显得有些微妙。过去一年,OpenAI 的研究方向不断扩散:推理模型、视 频、多模态、智能体、浏览器……几乎每一条热门技术路径上都能看到 OpenAI 的影子。但往往是发布 即巅峰,后续声量迅速下滑,难以形成持续性的产品势能。 比如 Sora 发布时震惊全网,但之后迟迟没有面向公众的可用产品;GPT Store 一度被视为生态爆发 点,但上线后应用质量参差,热度远不及预期……这种强首发、弱跟进的节奏,让许多原本被寄予厚望 的方向停留在概念或演示层面,而没有成长 ...
迎接「万物皆可RAG」时代:最新综述展示50多种多模态组合的巨大待探索空间
机器之心· 2025-12-02 09:18
Core Insights - The article discusses the emergence of Multimodal Retrieval-Augmented Generation (MM-RAG) as a new field, highlighting its potential applications and the current state of research, which is still in its infancy [2][5][17] - A comprehensive survey published by researchers from Huazhong University of Science and Technology, Fudan University, China Telecom, and the University of Illinois at Chicago covers nearly all possible combinations of modalities for input and output in MM-RAG [4][17] Summary by Sections Overview of MM-RAG - MM-RAG is an evolution of traditional Retrieval-Augmented Generation (RAG) that incorporates multiple modalities such as text, images, audio, video, code, tables, knowledge graphs, and 3D objects [2][4] - Current research primarily focuses on limited combinations of modalities, leaving many potential applications unexplored [2][5] Potential Combinations - The authors identify a vast space of potential input-output modality combinations, revealing that out of 54 proposed combinations, only 18 have existing research [5][6] - Notably, combinations like "text + video as input, generating video as output" remain largely untapped [5] Classification Framework - A new classification framework for MM-RAG is established, systematically organizing existing research and clearly presenting the core technical components of different MM-RAG systems [6][15] - This framework serves as a reference for future research and development in the field [6][15] MM-RAG Workflow - The MM-RAG workflow is divided into four key stages: 1. Pre-retrieval: Organizing data and preparing queries [11] 2. Retrieval: Efficiently finding relevant information from a multimodal knowledge base [12] 3. Augmentation: Integrating retrieved multimodal information into the large model [13] 4. Generation: Producing high-quality multimodal outputs based on input and augmented information [14][15] Practical Guidance - The survey provides a one-stop guide for building MM-RAG systems, covering training, evaluation, and application strategies [17][18] - It discusses training methods to maximize retrieval and generation capabilities, summarizes existing evaluation metrics, and explores potential applications across various fields [18]
AAAI 2026 Oral:明略科技开创稀疏数据「信息瓶颈动态压缩」,精度+速度双SOTA
机器之心· 2025-12-02 06:47
Core Insights - The article discusses the challenges of "Efficient AI," particularly in the context of transformer models becoming larger and more general, while also becoming computationally heavy for edge devices like robots [1][2] - A paper titled "CompTrack," accepted for oral presentation at AAAI 2026, addresses the issue of whether models need to process all input data, showcasing how compression techniques can significantly reduce computational costs while maintaining or even improving model performance [2][14] Redundancy Challenges - Current AI models face "Dual-Redundancy" challenges, which include: 1. Spatial Redundancy: Unrelated background points and blank areas are processed, wasting computational resources and degrading accuracy [3][5] 2. Informational Redundancy: Even in relevant foreground targets, there is a prevalence of redundant and low-value information, which can lead to inefficiencies [5][7] CompTrack Framework - CompTrack proposes an end-to-end framework that addresses both types of redundancy simultaneously [7] - The framework includes: 1. A Spatial Foreground Predictor (SFP) that filters out low-information background noise using information entropy theory [8] 2. An Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module designed to dynamically compress information redundancy in the foreground [10][11] Efficiency and Performance - The IB-DTC module is significant for Efficient AI as it: 1. Is based on the Information Bottleneck principle, retaining only valuable information for predictions [11] 2. Utilizes online Singular Value Decomposition (SVD) for dynamic compression rates based on the intrinsic rank of input data [12] 3. Allows for end-to-end training by using SVD as a guide for optimal compression rates [12] Application and Results - CompTrack has been applied to challenging 3D point cloud tracking tasks, demonstrating that systematic compression of information redundancy is highly effective [14] - The framework not only enhances efficiency but also sets a precedent for addressing information redundancy in various fields, including sensor fusion in robotics and multimodal processing in visual-language models [14][15] - Performance metrics show that CompTrack achieves real-time performance at 80 FPS on RTX 3090, surpassing state-of-the-art methods, with a significant reduction in computational load to 0.94G FLOPs [15]
五年,终于等来Transformers v5
机器之心· 2025-12-02 06:47
Core Insights - The article discusses the release of the first release candidate version v5.0.0rc0 of the Transformers library, marking a significant transition from version 4 to version 5 after a five-year technical cycle [2] - The library has seen a dramatic increase in usage, with daily downloads rising from 20,000 at the time of v4's release to over 3 million today, and total installations surpassing 1.2 billion [2] - The core focus of the v5 update is on simplicity, pre-training, interoperability with high-performance inference engines, and making quantization a core feature [2][3] Evolution and Features - The v5 version establishes PyTorch as the sole core backend and emphasizes four key dimensions of evolution: extreme simplicity, transition from fine-tuning to pre-training, interoperability with high-performance inference engines, and enhanced quantization capabilities [2] - The team aims for a clean and clear model integration approach, promoting broader standardization and stronger generality [4] - Over the past five years, an average of 1-3 new models has been added weekly, with the goal of becoming the only trusted source for model definitions [4] Modular Design and Tools - Hugging Face has advanced a modular design approach, simplifying maintenance and speeding up integration while fostering community collaboration [6] - The introduction of the AttentionInterface provides a centralized abstraction layer for attention mechanisms, streamlining the management of common auxiliary functions [8] - Tools are being developed to identify similarities between new models and existing architectures, aiming to automate the model conversion process into the Transformers format [9][10] Training Enhancements - The v5 version increases support for pre-training, with redesigned model initialization and support for forward and backward propagation optimization operators [15][16] - Hugging Face continues to collaborate closely with fine-tuning tools in the Python ecosystem and ensures compatibility with tools in the JAX ecosystem [17] Inference Improvements - Inference is a key focus of the v5 update, introducing dedicated kernels, cleaner default settings, new APIs, and optimized support for inference engines [18][19] - The v5 version aims to complement specialized inference engines rather than replace them, ensuring compatibility with engines like vLLM, SGLang, and TensorRT-LLM [21] Local Deployment and Quantization - The team collaborates with popular inference engines to allow Transformers to be used as a backend, enhancing the value of models added to Transformers [23] - Quantization is positioned as a core capability of Transformers, ensuring compatibility with major functionalities and providing a reliable framework for training and inference [27]
华为新开源!扩散语言模型突破32K上下文,还解锁了「慢思考」
机器之心· 2025-12-02 06:47
Core Insights - The article discusses the significant paradigm shift in text generation from Auto-Regressive models to Diffusion Language Models, highlighting the limitations of long sequence training and the recent advancements made by Huawei with the openPangu-R-7B-Diffusion model [1][14]. Model Performance - The openPangu-R-7B-Diffusion model achieved new state-of-the-art (SOTA) records in various benchmarks, demonstrating superior performance in general capabilities, mathematical reasoning, and code generation compared to other models [2][3]. - In the MMLU benchmark, openPangu-R-7B-Diffusion scored 81.66, surpassing LLaDA 2.0-mini-preview by 9.17 points [2]. - The model's performance in mathematical reasoning (MATH) reached 84.26, significantly leading over similar models [3]. Architectural Innovations - The model incorporates an innovative causal attention mask architecture, which allows for seamless migration from Auto-Regressive to BlockDiffusion models, addressing the architectural adaptation challenges [5][7]. - By retaining the causal attention characteristics, the model reduces adaptation costs and maximizes compatibility with pre-trained knowledge from Auto-Regressive models [8][10]. Training and Inference Efficiency - The training strategy of openPangu-R-7B-Diffusion optimizes the BlockDiffusion approach, enhancing the efficiency of the model [10]. - The model employs a dual-mode decoding capability, allowing users to balance generation quality and speed through different sampling settings [15]. Conclusion - The release of openPangu-R-7B-Diffusion marks a significant advancement in the ability of diffusion models to handle complex long texts, proving that they can achieve both speed and depth in processing [14].
NeurIPS 2025|CAKE:大模型驱动的贝叶斯优化新配方,让黑箱优化更智能、更高效
机器之心· 2025-12-02 06:47
以下文章来源于香港中文大学深圳人工智能学院 ,作者智启未来的 香港中文大学深圳人工智能学院 . 欢迎关注香港中文大学(深圳)人工智能学院(SAI),拥抱AI,共建未来! 在科学与工程实践中,常会遇到计算成本高、评估耗时的函数优化问题,例如复杂机器学习模型的超参数调整或新型材料的设计。贝叶斯优化(Bayesian Optimization,BO)作为针对这类 "黑箱" 问题的优化方法,已被证明具备良好效果。然而,该方法的性能很大程度上受限于其内部代理模型的选择,特别是当采 用高斯过程(Gaussian Process,GP)作为代理模型时,核函数的设定尤为关键。若核函数与问题特性不匹配,优化进程可能收敛缓慢,甚至无法得到理想的结 果。 Systems(NeurIPS 2025)接收,论文题为 "Adaptive Kernel Design for Bayesian Optimization Is a Piece of CAKE with LLMs". 该工作提出一个突破性的框架, 利用大 语言模 型(LLMs)的推理与生成能力,在优化过程中自动、动态地设计最优的高斯过程(GP)核函数。这项研究为构建更智能、高效 ...
架构解耦是统一多模态模型所必须的吗?全新AIA损失:No
机器之心· 2025-12-02 05:07
近一年以来,统一理解与生成模型发展十分迅速,该任务的主要挑战在于视觉理解和生成任务本身在网络层间会产生冲突。早期的完全统一模型( 如 Emu3 )与 单任务的方法差距巨大,Janus-Pro、BAGEL 通过一步一步解耦模型架构,极大地减小了与单任务模型的性能差距,后续方法甚至通过直接拼接现有理解和生成模 型以达到极致的性能。 香港中文大学 MMLab 和美团的研究者 相信,在不久的将来统一模型的性能一定能够达到单任务的水平,但同时也引起了他们的思考, 目前通过拆解架构换取性 能提升的方式真的是正确的吗,它是否背离统一模型的初衷,它能够提升性能的内在原因又是什么,这种方式真的是统一模型必须的吗? 「统一模型的初衷」以及「 架构解耦的缺点」 统一理解生成模型的初衷是为了通过透明化、合理化的图文交错思考过程,提高单任务的性能 ,例如让模型走迷宫时统一模型可以生成每一步对应的图像,可以 在模型做数学题的时候给图像画上辅助线,或者是在生成一张图像的时候边画边思考有没有生成不合理的地方并且自动修正,这些都是 Uni-MMMU 等当前统一模 型基准所关注,也是它本身被独立成一个领域的初衷。 研究者首先通过研究不同架构的统 ...
刚刚,千问App把谷歌和OpenAI的「付费绝活」塞进了手机,还免费?
机器之心· 2025-12-02 05:07
Core Insights - The article discusses the significant updates to the Qianwen App, which integrates two advanced visual models, Qwen-Image and Wan 2.5, making them accessible to ordinary users without technical expertise [1][4][36] Group 1: Qwen-Image Model - Qwen-Image is recognized for its strong visual logic understanding, allowing it to accurately interpret complex spatial relationships and geometric structures, outperforming many existing models [8][9][65] - The model excels in maintaining identity consistency during image editing, which is crucial for users seeking reliable results in complex scenarios [18][32] - Qwen-Image has shown impressive performance in multi-image fusion tasks, allowing for seamless integration of different visual elements while preserving their unique characteristics [29][32] Group 2: Wan 2.5 Model - Wan 2.5 represents a breakthrough in AI video generation, enabling native audio-visual synchronization, which enhances the user experience by eliminating the need for separate audio processing [34][68] - The model can generate videos that include original music and dialogue, showcasing its ability to understand and integrate multiple modalities [43][70] - Wan 2.5's architecture allows it to process text, images, video, and audio signals simultaneously, facilitating complex creative tasks that were previously challenging [68][70] Group 3: User Accessibility and Integration - The integration of these models into the Qianwen App eliminates barriers for users, allowing them to create high-quality visual and audio content without needing coding skills or expensive hardware [4][75] - The app serves as a comprehensive platform for multi-modal generation, enabling users to transition smoothly from image creation to video production within a single interface [45][47] - This development reflects Alibaba's long-term investment in building a robust ecosystem of multi-modal generative models, positioning it as a leader in the AI creative tools market [72][74]
刚刚,霸榜神秘视频模型身份揭晓,原来它就是「David」
机器之心· 2025-12-02 00:17
Core Insights - Runway's Gen-4.5 has emerged as the leading state-of-the-art (SOTA) video generation model, setting new industry standards in motion quality, prompt adherence, and visual realism [1][3][8] Model Performance - Gen-4.5 has achieved an ELO Score of 1247, surpassing competitors like Veo 3/3.1, Kling 2.5, and Sora 2 Pro, showcasing unprecedented visual realism and creative control capabilities [3][6][8] - The model maintains speed and efficiency while delivering significant quality improvements, making advanced video generation accessible to creators of various scales [8][20] Key Features - **Precise Prompt Adherence**: Gen-4.5 demonstrates exceptional physical accuracy and visual detail, accurately portraying object motion, fluid dynamics, and intricate surface details [11][12] - **Expressive Characters**: The model can depict nuanced emotions and lifelike facial details, enhancing character representation [14] - **Stylized Control and Visual Consistency**: It supports a wide range of aesthetic styles, from photorealism to stylized animation, while maintaining a coherent visual language [16][18] Deployment and Limitations - Gen-4.5 is built on NVIDIA architecture, optimizing training efficiency and inference speed through collaboration with NVIDIA [20] - Despite its advancements, Gen-4.5 exhibits common limitations found in video generation models, such as causal reasoning issues and object permanence challenges [21][22]