扩散大语言模型
Search documents
让搜索Agent不「傻等」:人大团队依托扩散模型实现「一心二用」,边等搜索结果边思考,加速15%性能不减
量子位· 2026-03-02 09:09
Core Viewpoint - The article discusses the limitations of traditional search agents and introduces the concept of diffusion large language models (dLLM) as a potential solution to enhance search efficiency by allowing parallel reasoning and action during the search process [1][8][28]. Group 1: Limitations of Traditional Search Agents - Traditional search agents operate in a strictly serial manner, leading to delays as users wait for results before continuing their thought process [8]. - The current frameworks, such as ReAct, result in significant end-to-end time consumption due to this serial waiting [9]. - The article highlights that self-regressive models cannot perform parallel reasoning, which limits their efficiency in search tasks [10][16]. Group 2: Introduction of Diffusion Large Language Models (dLLM) - The dLLM can perform "dual-tasking," allowing it to think about the next steps while waiting for search results [5][11]. - Unlike traditional models, dLLMs utilize a non-sequential token generation process, enabling them to generate important parts of the output first [12][13]. - Initial tests of dLLMs as search agents showed poor performance, indicating that while they have potential, they require further training to be effective [14][16]. Group 3: Training Methodology for dLLM - The training process consists of two phases: Agentic SFT (Supervised Fine-Tuning) and Agentic VRPO (Variance-Reduced Preference Optimization) [18][20]. - The first phase involves generating high-quality search trajectories and ensuring the model learns to generate thoughts and tool calls without seeing the search results [19]. - The second phase focuses on refining the model's reasoning paths through preference learning, improving accuracy across various datasets [20]. Group 4: P-ReAct for Enhanced Efficiency - P-ReAct is introduced as a method to accelerate reasoning and tool calling without additional training [21][22]. - This method involves pre-filling boundary markers and adjusting confidence scores for tool calling areas, allowing the model to prioritize these calls [23][24]. - The implementation of P-ReAct resulted in significant improvements in response times and accuracy, demonstrating the effectiveness of the dLLM in search tasks [25][26]. Group 5: Performance and Implications - The dLLM-Searcher achieved an average accuracy of 57.0% on multiple benchmark datasets, surpassing traditional methods and showing strong generalization capabilities [25][27]. - The results indicate that dLLMs can match or exceed the reasoning capabilities of self-regressive models while leveraging their unique structural advantages [28]. - This advancement opens new avenues for optimizing search agent efficiency, suggesting a shift in how search tasks may be approached in the future [29].
7B扩散语言模型单样例1000+ tokens/s!上交大联合华为推出LoPA
机器之心· 2025-12-31 08:11
Core Insights - The article discusses a breakthrough in the field of diffusion large language models (dLLMs) through a new decoding algorithm called LoPA (Lookahead Parallel Decoding), which significantly enhances inference speed and parallelism [2][3][36]. Group 1: LoPA Algorithm Features - LoPA achieves a high degree of parallelism, increasing the tokens generated per step (TPF) from 3.1 to 10.1, thus surpassing traditional methods [3][7]. - The algorithm is plug-and-play, requiring no retraining or fine-tuning of the model [8]. - It introduces a lookahead parallel decoding mechanism that actively explores different token filling orders to avoid local optima [9]. - The accompanying LoPA-Dist system maximizes hardware utilization by supporting both CUDA and Ascend platforms [10]. Group 2: Performance Metrics - LoPA has demonstrated a single-sample throughput of 1073.9 tokens/s on the Huawei Ascend 910C platform, significantly outperforming baseline models [3][33]. - In experiments, LoPA integrated with D2F-Dream achieved a TPF of 10.1 on the GSM8K benchmark, drastically reducing the total inference steps [28][31]. - The system's performance metrics indicate that it can effectively convert algorithmic parallelism into substantial real-time acceleration, achieving over 1000 tokens/s on dedicated engines [34]. Group 3: System Design and Optimization - The LoPA-Dist distributed inference system employs a new branch parallelism strategy, which can be combined with existing tensor parallelism methods [25]. - It is optimized for different hardware platforms, with LoPA-Dist-NV designed for low-latency scenarios and LoPA-Dist-Ascend aimed at high-throughput service environments [26]. Group 4: Future Directions - The team plans to explore the application of LoPA in other dLLM architectures, such as SDAR, to further advance efficient generative models [36].
唯快不破:上海AI Lab 82页综述带你感受LLM高效架构的魅力
机器之心· 2025-08-25 09:10
Core Insights - The article discusses the advancements and challenges in large language models (LLMs), emphasizing their transformative impact on human-computer interaction and the need for efficient architectures to overcome high training and inference costs [2][3][8]. Group 1: LLM Architecture and Efficiency - The efficiency of LLMs is primarily attributed to the Transformer architecture, which, despite its breakthroughs, faces challenges due to its O(N^2) complexity in long sequence tasks [3][4]. - Recent innovations in Transformer architecture have emerged, but a comprehensive review summarizing these advancements has been lacking [4][5]. - A collaborative effort by Shanghai AI Lab and several institutions has resulted in a survey of over 440 papers, focusing on the latest progress in efficient LLM architectures [5][6]. Group 2: Categories of Efficient Architectures - The survey categorizes efficient LLM architectures into seven types, including linear sequence modeling, sparse sequence modeling, efficient full attention, sparse expert models, mixed model architectures, diffusion language models, and applications to other modalities [6][8]. - Linear sequence modeling aims to reduce attention training and inference complexity without incurring KV cache overhead [6][8]. - Sparse sequence modeling leverages the inherent sparsity of attention maps to accelerate computation [21][22]. Group 3: Innovations in Attention Mechanisms - Efficient full attention methods optimize memory access and KV storage while maintaining complete attention [22][23]. - Sparse expert models enhance model capacity without proportionally increasing computational costs through conditional activation of experts [27][28]. - Mixed architectures find a balance between linear/sparse attention and full attention, optimizing both efficiency and performance [35][36]. Group 4: Applications and Future Directions - Diffusion language models represent a novel approach by applying diffusion models from visual tasks to language generation, significantly improving generation speed [38][39]. - Efficient architectures are being applied across various modalities, including vision and audio, demonstrating their versatility and effectiveness [44][45]. - The overarching goal is to achieve substantial acceleration in AI development, akin to the phrase "Speed Always Wins," suggesting a focus on efficiency in training and deploying powerful models [45].