Workflow
大语言模型(LLM)
icon
Search documents
AI写综述,靠谱吗?
Hu Xiu· 2025-07-04 07:49
Core Insights - The article discusses the advancements in artificial intelligence (AI) that are enabling faster and more efficient literature reviews in scientific research, particularly through the development of AI systems like FutureHouse's PaperQA2, which can summarize vast amounts of scientific knowledge quickly and accurately [1][6]. Group 1: AI in Literature Review - AI systems are being developed to automate the process of literature reviews, with tools like Consensus and Elicit helping researchers summarize and categorize scientific publications [2][4]. - Despite advancements, current AI tools cannot independently produce high-quality systematic reviews, which require rigorous methodologies and meta-analyses [2][3]. - The emergence of generative AI models has raised concerns about the potential for producing low-quality or misleading reviews, as these models may not adhere to established research practices [2][3][10]. Group 2: Challenges and Limitations - Systematic reviews involve at least 25 rigorous steps, making them time-consuming and complex, often taking months or years to complete [7][8]. - Many AI tools, including Elicit, are limited to searching open-access papers and abstracts, which restricts their ability to access full-text articles behind paywalls [5][6]. - The performance of AI systems in generating literature reviews is still under scrutiny, with experts emphasizing the need for transparency and reproducibility in the review process [9][12]. Group 3: Future Directions - There is ongoing research to improve AI tools for literature reviews, with a focus on enhancing their efficiency and accuracy while maintaining rigorous standards [9][12]. - Non-profit organizations are being encouraged to participate in the development of AI tools to ensure reliability and transparency in scientific literature synthesis [12]. - Funding initiatives are being announced to support the development of evidence synthesis systems, indicating a growing interest in improving the quality of literature reviews through AI [12].
AI:加速能力退化的元凶
3 6 Ke· 2025-07-02 07:16
Core Viewpoint - The article argues that over-reliance on Large Language Models (LLMs) is leading to a decline in critical thinking among engineers, emphasizing the need to preserve the essence of programming as a craft [1][3][17]. Group 1: Risks of Over-Reliance on LLMs - Engineers who treat LLMs as partners often prioritize speed over depth of thought, which can lead to a decline in their skills and critical thinking [5][6]. - The use of LLMs can result in a loss of the flow state and creative enjoyment for many developers [7]. - LLMs may produce incorrect code or code with hidden logical flaws, increasing risks if users lack judgment [12]. Group 2: Importance of Program Theory and Entropy - LLMs cannot grasp program theory and program entropy, which are essential for effective programming and understanding the complexities of software development [9][13]. - Program theory emphasizes that programming is about forming insights and theories rather than just writing code, which is crucial for maintaining and modifying software [10][11]. - Program entropy highlights that any modification to a program increases complexity, and only humans can effectively manage this entropy [14][15]. Group 3: Long-Term Value of Human Engineers - The article suggests that LLMs will not replace human engineers, as the unique human ability to think critically and deeply about engineering problems remains irreplaceable [8][18]. - Companies pursuing AI for cost reduction may face new risks and long-term costs, indicating that the value of human engineering skills will persist [18][19].
大模型时代,通用视觉模型将何去何从?
机器之心· 2025-07-02 00:54
Core Viewpoint - The article discusses the evolution of Vision Generalist Models (VGM) in the context of the rise of multimodal large models, emphasizing the need for a distinct focus on visual data despite the shift towards integrating visual modalities with language models [1][2]. Group 1: VGM Overview - VGM aims to create a unified framework capable of handling various visual tasks and modalities, similar to the success of large language models in natural language processing [7]. - VGM's key capability is its ability to process multimodal inputs, including images, point clouds, and videos, through a shared representation method [7][8]. - The model supports multiple visual tasks simultaneously, allowing for parallel processing within a single framework [8]. Group 2: Data, Tasks, and Evaluation - VGM utilizes large and diverse datasets for training and evaluation, covering various types of visual data to support multimodal learning [9]. - Visual tasks are categorized into four types: image tasks, geometric tasks, time series tasks, and other visual-related tasks [9]. - Modern evaluation methods focus on cross-task generalization and multimodal processing capabilities, differing from traditional single-task assessments [9]. Group 3: Model Design Paradigms - Existing VGM design paradigms focus on unifying different visual modality inputs and diverse task outputs, primarily categorized into encoding-based frameworks and sequence-to-sequence frameworks [12][13]. - Encoding-based frameworks create a shared feature space for different input modalities, while sequence-to-sequence frameworks are suitable for tasks with variable-length inputs and outputs [12][13]. Group 4: Current Progress and Future Directions - Current VGM research has made significant progress in unified processing of multiple tasks and modalities but faces challenges in optimizing framework design and improving training efficiency [16]. - Data acquisition and annotation remain bottlenecks for VGM development, with future research likely focusing on automated annotation techniques and large-scale unsupervised learning methods [16]. - Despite challenges, VGM shows extensive potential in practical applications, extending beyond traditional visual tasks to complex multimodal tasks across various fields such as intelligent surveillance, autonomous driving, and robotics [16].
只用2700万参数,这个推理模型超越了DeepSeek和Claude
机器之心· 2025-06-30 10:23
Core Insights - The article discusses the need for transformation in the architecture of large language models (LLMs), particularly focusing on the limitations of current chain-of-thought (CoT) techniques, which face challenges such as task complexity, high data requirements, and latency issues [2][4]. Group 1: Hierarchical Reasoning Model (HRM) - The Hierarchical Reasoning Model (HRM) is introduced as a novel cyclic architecture inspired by the human brain's layered and multi-timescale processing mechanisms, achieving high computational depth while maintaining training stability and efficiency [3][6]. - HRM operates through two interdependent cyclic modules: a high-level module for slow, abstract planning and a low-level module for fast, detailed computations, achieving remarkable performance on complex reasoning tasks with only 27 million parameters and 1,000 training samples [4][5]. - HRM does not require pre-training or CoT data, yet it performs nearly perfectly on challenging tasks such as complex Sudoku puzzles and optimal pathfinding in large mazes, outperforming larger models with longer context windows [5][6]. Group 2: Design and Mechanisms - The core design of HRM is based on hierarchical processing and time-scale separation, where high-level brain regions integrate information over longer time scales while low-level regions handle immediate sensory information [12][13]. - HRM incorporates feedback loops similar to the brain's dense recurrent neural network connections, enhancing representation accuracy and contextual adaptability while avoiding issues related to backpropagation through time (BPTT) [14][19]. - The model introduces approximate gradients and deep supervision, allowing for efficient memory usage and improved training dynamics, which contrasts with traditional methods that require extensive memory and time [20][23]. Group 3: Performance and Adaptability - HRM demonstrates hierarchical convergence, with the high-level module stabilizing while the low-level module converges repeatedly, leading to rapid convergence and minimal residuals compared to deep neural networks [17][36]. - The model features adaptive computation time (ACT), enabling it to dynamically adjust computational resources based on task complexity, thus optimizing performance without significant resource expenditure [25][27]. - HRM can seamlessly extend inference computation by adjusting parameters without the need for retraining or architectural changes, showcasing its flexibility in handling complex reasoning tasks [28][36]. Group 4: Experimental Results - Experimental results indicate that HRM excels in complex reasoning tasks, raising questions about the underlying reasoning algorithms it employs, which is crucial for enhancing model interpretability [31][39]. - Visualizations of HRM's reasoning processes reveal its strategies in maze and Sudoku tasks, demonstrating a combination of exploration and optimization techniques that resemble depth-first search methods [31][38]. - The hierarchical structure of HRM emerges as a natural characteristic during the learning of complex reasoning tasks, rather than being an inherent property of the model architecture [34].
为什么说大多数LLM初创企业注定都将失败?
3 6 Ke· 2025-06-30 07:13
Group 1 - The AI startup ecosystem is facing a harsh reality, with many companies mistakenly believing they are building on a stable platform provided by large language models (LLMs), when in fact they are nesting within predators [2][4] - The core illusion of modularity in the LLM startup boom is flawed, as model suppliers are not neutral layers but vertically integrated companies that control user interfaces and distribution channels [3][4] - The influx of venture capital into LLM-based startups has led to a strategic miscalculation, conflating the ease of prototype development with the sustainability of business models [4][5] Group 2 - Some startups may survive the collapse by possessing irreplaceable competitive advantages, such as distribution barriers, proprietary data, or control over inference [5][6] - The allure of the LLM shell model is rooted in its perceived advantages in a capital-driven environment, but it obscures the fundamental strategic flaw of lacking control over value engines [7][8] - The behavior of model suppliers reflects rational choices typical of monopolistic enterprises, as they seek to expand upstream and capture profits rather than serve as passive infrastructure [6][8] Group 3 - Founders must critically assess their reliance on others' LLMs and consider their business positioning, asking key questions about their unique advantages and potential vulnerabilities [8][9] - The new decision-making criteria for startups include rapid prototyping, quick iterations, and minimal cash burn, emphasizing the need for a solid foundation beyond mere API usage [8][10] - The era of LLM shell products has ended, and the new landscape favors those who control data, distribution, and infrastructure as the true competitive barriers [12]
从后训练回到预训练,LLM+RL 的潜力兑现有有机会走更远吗?
机器之心· 2025-06-28 05:22
Core Insights - The article discusses the potential of combining Reinforcement Learning (RL) with Large Language Models (LLMs), particularly focusing on the transition from post-training to pre-training phases, highlighting the challenges and opportunities in this area [2][3]. Group 1: Transition from Post-training to Pre-training - The integration of RL with LLMs is seen as a significant technological advancement, extending applications from post-training to pre-training phases [2]. - LLMs traditionally rely on supervised learning, which requires extensive and accurate human-provided data, making RL a viable alternative to address these limitations [3]. - RL's ability to generate data through model-environment interaction reduces the dependency on high-quality labeled data, thus lowering the requirements for supervision [3][4]. Group 2: Applications and Innovations in RL - Initial applications of RL in LLMs were focused on post-training, with techniques like Reinforcement Learning from Human Feedback (RLHF) being prominent [4]. - Recent advancements, such as Reinforcement Pre-Training (RPT) by researchers from Microsoft and Tsinghua University, have expanded RL's application to the pre-training phase, showing improved performance on certain benchmarks [4][5]. - RPT redefines the next token prediction (NTP) task as a verifiable reasoning task, potentially unlocking RL's capabilities while reducing reliance on labeled data [5]. Group 3: Challenges and Limitations - Despite the promising developments, the known limitations of RL in LLMs are still being uncovered, indicating that while the path appears bright, significant challenges remain [4][6]. - The training data and settings for RPT have yet to be validated across broader text and foundational models, and the computational resource demands for RL training continue to pose challenges [5].
AgentAuditor: 让智能体安全评估器的精确度达到人类水平
机器之心· 2025-06-27 04:02
Core Insights - LLM Agents are evolving from mere text generators to autonomous decision-makers capable of complex task execution, raising safety concerns regarding their interactions [1] - Existing safety evaluation benchmarks for LLM Agents lack effective evaluators, struggling to assess the nuanced risks associated with complex interactions [1] - The introduction of AgentAuditor, a framework developed by researchers from multiple universities, aims to enhance the safety evaluation of LLM Agents to human expert levels [2] Evaluation Challenges - Traditional LLM safety assessments excel in content generation evaluation but fail to address the complexities of agent interactions and decision-making processes [1] - Current evaluation methods, whether rule-based or model-based, face challenges in accurately identifying subtle risks and understanding ambiguous rules [1] AgentAuditor Framework - AgentAuditor combines structured memory and retrieval-augmented reasoning (RAG) to enhance LLM evaluators' ability to learn and understand complex interaction records [4] - The framework operates through three key stages: 1. Feature Memory Construction transforms raw interaction records into a structured database containing deep semantic information [4] 2. Reasoning Memory Construction selects representative cases to generate high-quality reasoning chains that guide subsequent evaluations [5] 3. Memory-Augmented Reasoning dynamically retrieves relevant reasoning experiences to assist LLM evaluators in making precise judgments [6] ASSEBench Dataset - ASSEBench is a newly created benchmark designed to validate AgentAuditor's capabilities, consisting of 2,293 meticulously annotated real agent interaction records [9] - The benchmark covers 15 risk types, 528 interaction environments, and spans 29 application scenarios, ensuring comprehensive evaluation [9] - It employs a human-machine collaborative annotation process with strict and lenient judgment standards for nuanced risk assessment [9] Experimental Results - Extensive experiments demonstrate that AgentAuditor significantly improves LLM evaluators' performance across various datasets, achieving human-level accuracy [10][11] - For instance, the Gemini-2-Flash-Thinking model saw an F1 score increase of up to 48.2% on ASSEBench-Safety, nearing human-level performance [12] - AgentAuditor's adaptive capabilities allow it to adjust reasoning strategies based on different evaluation standards, effectively narrowing performance gaps among models [12] Conclusion - The introduction of AgentAuditor and ASSEBench provides robust evaluation tools and research foundations for building more trustworthy LLM Agents [17] - This advancement not only propels the development of LLM evaluators but also guides the future construction of safer and more reliable agent defense systems [17]
AI 开始「自由玩电脑」了!吉大提出「屏幕探索者」智能体
机器之心· 2025-06-27 04:02
Core Viewpoint - The article discusses the development of a vision-language model (VLM) agent named ScreenExplorer, which is designed to autonomously explore and interact within open graphical user interface (GUI) environments, marking a significant step towards achieving general artificial intelligence (AGI) [2][3][35]. Group 1: Breakthroughs and Innovations - The research introduces three core breakthroughs in the training of VLM agents for GUI exploration [6]. - A real-time interactive online reinforcement learning framework is established, allowing the VLM agent to interact with a live GUI environment [8][11]. - The introduction of a "curiosity mechanism" addresses the sparse feedback issue in open GUI environments, motivating the agent to explore diverse interface states [10][12]. Group 2: Training Methodology - The training involves a heuristic and world model-driven reward system that encourages exploration by providing immediate rewards for diverse actions [12][24]. - The GRPO algorithm is utilized for reinforcement learning training, calculating the advantage of actions based on rewards obtained [14][15]. - The training process allows for multiple parallel environments to synchronize reasoning, execution, and recording, enabling "learning by doing" [15]. Group 3: Experimental Results - Initial experiments show that without training, the Qwen2.5-VL-3B model fails to interact effectively with the GUI [17]. - After training, the model demonstrates improved capabilities, successfully opening applications and navigating deeper into pages [18][20]. - The ScreenExplorer models outperform general models in exploration diversity and interaction effectiveness, indicating a significant advancement in autonomous GUI interaction [22][23]. Group 4: Skill Emergence and Conclusion - The training process leads to the emergence of new skills, such as cross-modal translation and complex reasoning abilities [29][34]. - The research concludes that ScreenExplorer effectively enhances GUI interaction capabilities through a combination of exploration rewards, world models, and GRPO reinforcement learning, paving the way for more autonomous agents and progress towards AGI [35].
舍弃CUDA编程!CMU等用几十行代码将LLM编译成巨型内核,推理延迟可降6.7倍
机器之心· 2025-06-21 01:33
Core Viewpoint - The introduction of the Mirage Persistent Kernel (MPK) compiler by a team led by Zhihao Jia from CMU significantly reduces the inference latency of large language models (LLMs) by 1.2 to 6.7 times, addressing the high manual optimization costs and end-to-end delays associated with CUDA-driven LLM inference [3][4][12]. Group 1: Introduction of MPK - MPK is designed to automatically convert LLMs into optimized megakernels, which can execute the entire model without interruption, thus enhancing performance [9][10]. - The MPK compiler allows developers to compile LLMs with minimal manual effort, requiring only a few lines of Python code [5][12]. Group 2: Performance Advantages - MPK eliminates kernel launch overhead and maximizes the overlap of computation, data loading, and GPU communication, resulting in significantly lower inference latency [14][18]. - The performance improvements of MPK increase with the number of GPUs, making it particularly efficient in multi-GPU deployment scenarios [18]. Group 3: Working Mechanism of MPK - MPK consists of two main components: a compiler that transforms LLM computation graphs into fine-grained task graphs, and a runtime system that executes these task graphs within a single megakernel [19][24]. - The MPK compiler captures dependencies at a finer granularity, allowing for more aggressive pipeline optimizations compared to existing systems [26][27]. Group 4: Future Plans - The team aims to enhance MPK's usability and performance, with ongoing efforts to support dynamic workloads and advanced scheduling strategies [40][43].
2025 年了,企业的 AI 采购预算都在怎么花?
机器之心· 2025-06-20 17:04
本文来自PRO会员通讯内容,文末关注「机器之心PRO会员」,查看更多专题解读。 a16z 近期发布 2025 年度的「企业如何采购 AI」主题报告,该报告基于对全球企业高管的深度访谈与广泛调 研,揭示了 2025 年企业在以 LLM 为代表的生成式 AI 的采购、部署与预算分配上的关键趋势。 目录 01. 为何企业的 AI 预算只增不减? 为什么企业在的 AI 支出一直在增加?企业的 AI 预算构成都有哪些变化?企业部署 AI 的目的在如何转变?... 02 . 货比三家,什么样的 LLM 能让企业掏钱? 为什么企业更看重 LLM 的「差异化」而非「商业化」?为什么开源模型越来越受欢迎?大小企业选择 LLM 的偏好有何区 别?... 03. 企业如何像采购传统软件一样采购 AI 模型? 企业现在采购 AI 模型都考虑哪些因素?外部基准对 AI 采购有什么影响?... ① 该报告是 a16z 的研究主题系列之一,其研究团队此前在 2024 年 2 月发布「企业构建和购买新一代人工智能的 16 项变革」。该报告从数十位《财富》500 强企业和顶级企业的领导者和 70 多位高管进行访谈和调查,得到了 16 项核心发 ...