Workflow
扩散大语言模型
icon
Search documents
让搜索Agent不「傻等」:人大团队依托扩散模型实现「一心二用」,边等搜索结果边思考,加速15%性能不减
量子位· 2026-03-02 09:09
GSAI IIR & GSAI-ML 团队 投稿 量子位 | 公众号 QbitAI 传统的搜索Agent有个问题: 想完了才去搜,搜的时候干等着,等完了再接着想。 就像你去餐厅点菜,非要把菜单研究透了才叫服务员,服务员去下单的时候你又呆坐着发愣,菜上了你才开始想下一道点什么。 正常人不是这样吃饭的。你会一边看菜单一边叫服务员,服务员去下单的时候你继续研究下一道菜点什么。 中国人民大学团队在论文DLLM-Searcher中,第一次让 扩散大语言模型 (dLLM) 学会了这种"一心二用"的本事。 先说清楚问题出在哪 目前主流的搜索Agent,不管是Search-R1还是R1Searcher,用的都是ReAct框架。这个框架的执行流程是严格串行的: 想→调工具→等结果→再想→再调工具→再等…… 每一轮里,"想"和"调工具"是模型一个token一个token从左到右吐出来的,等搜索引擎返回结果的时候模型完全闲着。多轮下来,延迟叠延 迟,用户体验直接拉胯。 团队算了笔账:在多跳问答任务里,这种串行等待吃掉了大量的端到端时间。 那能不能让模型在等搜索结果的时候,继续想下一步? 自回归模型做不到。因为它的注意力是因果的,必 ...
7B扩散语言模型单样例1000+ tokens/s!上交大联合华为推出LoPA
机器之心· 2025-12-31 08:11
Core Insights - The article discusses a breakthrough in the field of diffusion large language models (dLLMs) through a new decoding algorithm called LoPA (Lookahead Parallel Decoding), which significantly enhances inference speed and parallelism [2][3][36]. Group 1: LoPA Algorithm Features - LoPA achieves a high degree of parallelism, increasing the tokens generated per step (TPF) from 3.1 to 10.1, thus surpassing traditional methods [3][7]. - The algorithm is plug-and-play, requiring no retraining or fine-tuning of the model [8]. - It introduces a lookahead parallel decoding mechanism that actively explores different token filling orders to avoid local optima [9]. - The accompanying LoPA-Dist system maximizes hardware utilization by supporting both CUDA and Ascend platforms [10]. Group 2: Performance Metrics - LoPA has demonstrated a single-sample throughput of 1073.9 tokens/s on the Huawei Ascend 910C platform, significantly outperforming baseline models [3][33]. - In experiments, LoPA integrated with D2F-Dream achieved a TPF of 10.1 on the GSM8K benchmark, drastically reducing the total inference steps [28][31]. - The system's performance metrics indicate that it can effectively convert algorithmic parallelism into substantial real-time acceleration, achieving over 1000 tokens/s on dedicated engines [34]. Group 3: System Design and Optimization - The LoPA-Dist distributed inference system employs a new branch parallelism strategy, which can be combined with existing tensor parallelism methods [25]. - It is optimized for different hardware platforms, with LoPA-Dist-NV designed for low-latency scenarios and LoPA-Dist-Ascend aimed at high-throughput service environments [26]. Group 4: Future Directions - The team plans to explore the application of LoPA in other dLLM architectures, such as SDAR, to further advance efficient generative models [36].
唯快不破:上海AI Lab 82页综述带你感受LLM高效架构的魅力
机器之心· 2025-08-25 09:10
Core Insights - The article discusses the advancements and challenges in large language models (LLMs), emphasizing their transformative impact on human-computer interaction and the need for efficient architectures to overcome high training and inference costs [2][3][8]. Group 1: LLM Architecture and Efficiency - The efficiency of LLMs is primarily attributed to the Transformer architecture, which, despite its breakthroughs, faces challenges due to its O(N^2) complexity in long sequence tasks [3][4]. - Recent innovations in Transformer architecture have emerged, but a comprehensive review summarizing these advancements has been lacking [4][5]. - A collaborative effort by Shanghai AI Lab and several institutions has resulted in a survey of over 440 papers, focusing on the latest progress in efficient LLM architectures [5][6]. Group 2: Categories of Efficient Architectures - The survey categorizes efficient LLM architectures into seven types, including linear sequence modeling, sparse sequence modeling, efficient full attention, sparse expert models, mixed model architectures, diffusion language models, and applications to other modalities [6][8]. - Linear sequence modeling aims to reduce attention training and inference complexity without incurring KV cache overhead [6][8]. - Sparse sequence modeling leverages the inherent sparsity of attention maps to accelerate computation [21][22]. Group 3: Innovations in Attention Mechanisms - Efficient full attention methods optimize memory access and KV storage while maintaining complete attention [22][23]. - Sparse expert models enhance model capacity without proportionally increasing computational costs through conditional activation of experts [27][28]. - Mixed architectures find a balance between linear/sparse attention and full attention, optimizing both efficiency and performance [35][36]. Group 4: Applications and Future Directions - Diffusion language models represent a novel approach by applying diffusion models from visual tasks to language generation, significantly improving generation speed [38][39]. - Efficient architectures are being applied across various modalities, including vision and audio, demonstrating their versatility and effectiveness [44][45]. - The overarching goal is to achieve substantial acceleration in AI development, akin to the phrase "Speed Always Wins," suggesting a focus on efficiency in training and deploying powerful models [45].