Workflow
大语言模型(LLM)
icon
Search documents
大模型时代,通用视觉模型将何去何从?
机器之心· 2025-07-02 00:54
Core Viewpoint - The article discusses the evolution of Vision Generalist Models (VGM) in the context of the rise of multimodal large models, emphasizing the need for a distinct focus on visual data despite the shift towards integrating visual modalities with language models [1][2]. Group 1: VGM Overview - VGM aims to create a unified framework capable of handling various visual tasks and modalities, similar to the success of large language models in natural language processing [7]. - VGM's key capability is its ability to process multimodal inputs, including images, point clouds, and videos, through a shared representation method [7][8]. - The model supports multiple visual tasks simultaneously, allowing for parallel processing within a single framework [8]. Group 2: Data, Tasks, and Evaluation - VGM utilizes large and diverse datasets for training and evaluation, covering various types of visual data to support multimodal learning [9]. - Visual tasks are categorized into four types: image tasks, geometric tasks, time series tasks, and other visual-related tasks [9]. - Modern evaluation methods focus on cross-task generalization and multimodal processing capabilities, differing from traditional single-task assessments [9]. Group 3: Model Design Paradigms - Existing VGM design paradigms focus on unifying different visual modality inputs and diverse task outputs, primarily categorized into encoding-based frameworks and sequence-to-sequence frameworks [12][13]. - Encoding-based frameworks create a shared feature space for different input modalities, while sequence-to-sequence frameworks are suitable for tasks with variable-length inputs and outputs [12][13]. Group 4: Current Progress and Future Directions - Current VGM research has made significant progress in unified processing of multiple tasks and modalities but faces challenges in optimizing framework design and improving training efficiency [16]. - Data acquisition and annotation remain bottlenecks for VGM development, with future research likely focusing on automated annotation techniques and large-scale unsupervised learning methods [16]. - Despite challenges, VGM shows extensive potential in practical applications, extending beyond traditional visual tasks to complex multimodal tasks across various fields such as intelligent surveillance, autonomous driving, and robotics [16].
只用2700万参数,这个推理模型超越了DeepSeek和Claude
机器之心· 2025-06-30 10:23
Core Insights - The article discusses the need for transformation in the architecture of large language models (LLMs), particularly focusing on the limitations of current chain-of-thought (CoT) techniques, which face challenges such as task complexity, high data requirements, and latency issues [2][4]. Group 1: Hierarchical Reasoning Model (HRM) - The Hierarchical Reasoning Model (HRM) is introduced as a novel cyclic architecture inspired by the human brain's layered and multi-timescale processing mechanisms, achieving high computational depth while maintaining training stability and efficiency [3][6]. - HRM operates through two interdependent cyclic modules: a high-level module for slow, abstract planning and a low-level module for fast, detailed computations, achieving remarkable performance on complex reasoning tasks with only 27 million parameters and 1,000 training samples [4][5]. - HRM does not require pre-training or CoT data, yet it performs nearly perfectly on challenging tasks such as complex Sudoku puzzles and optimal pathfinding in large mazes, outperforming larger models with longer context windows [5][6]. Group 2: Design and Mechanisms - The core design of HRM is based on hierarchical processing and time-scale separation, where high-level brain regions integrate information over longer time scales while low-level regions handle immediate sensory information [12][13]. - HRM incorporates feedback loops similar to the brain's dense recurrent neural network connections, enhancing representation accuracy and contextual adaptability while avoiding issues related to backpropagation through time (BPTT) [14][19]. - The model introduces approximate gradients and deep supervision, allowing for efficient memory usage and improved training dynamics, which contrasts with traditional methods that require extensive memory and time [20][23]. Group 3: Performance and Adaptability - HRM demonstrates hierarchical convergence, with the high-level module stabilizing while the low-level module converges repeatedly, leading to rapid convergence and minimal residuals compared to deep neural networks [17][36]. - The model features adaptive computation time (ACT), enabling it to dynamically adjust computational resources based on task complexity, thus optimizing performance without significant resource expenditure [25][27]. - HRM can seamlessly extend inference computation by adjusting parameters without the need for retraining or architectural changes, showcasing its flexibility in handling complex reasoning tasks [28][36]. Group 4: Experimental Results - Experimental results indicate that HRM excels in complex reasoning tasks, raising questions about the underlying reasoning algorithms it employs, which is crucial for enhancing model interpretability [31][39]. - Visualizations of HRM's reasoning processes reveal its strategies in maze and Sudoku tasks, demonstrating a combination of exploration and optimization techniques that resemble depth-first search methods [31][38]. - The hierarchical structure of HRM emerges as a natural characteristic during the learning of complex reasoning tasks, rather than being an inherent property of the model architecture [34].
为什么说大多数LLM初创企业注定都将失败?
3 6 Ke· 2025-06-30 07:13
Group 1 - The AI startup ecosystem is facing a harsh reality, with many companies mistakenly believing they are building on a stable platform provided by large language models (LLMs), when in fact they are nesting within predators [2][4] - The core illusion of modularity in the LLM startup boom is flawed, as model suppliers are not neutral layers but vertically integrated companies that control user interfaces and distribution channels [3][4] - The influx of venture capital into LLM-based startups has led to a strategic miscalculation, conflating the ease of prototype development with the sustainability of business models [4][5] Group 2 - Some startups may survive the collapse by possessing irreplaceable competitive advantages, such as distribution barriers, proprietary data, or control over inference [5][6] - The allure of the LLM shell model is rooted in its perceived advantages in a capital-driven environment, but it obscures the fundamental strategic flaw of lacking control over value engines [7][8] - The behavior of model suppliers reflects rational choices typical of monopolistic enterprises, as they seek to expand upstream and capture profits rather than serve as passive infrastructure [6][8] Group 3 - Founders must critically assess their reliance on others' LLMs and consider their business positioning, asking key questions about their unique advantages and potential vulnerabilities [8][9] - The new decision-making criteria for startups include rapid prototyping, quick iterations, and minimal cash burn, emphasizing the need for a solid foundation beyond mere API usage [8][10] - The era of LLM shell products has ended, and the new landscape favors those who control data, distribution, and infrastructure as the true competitive barriers [12]
从后训练回到预训练,LLM+RL 的潜力兑现有有机会走更远吗?
机器之心· 2025-06-28 05:22
Core Insights - The article discusses the potential of combining Reinforcement Learning (RL) with Large Language Models (LLMs), particularly focusing on the transition from post-training to pre-training phases, highlighting the challenges and opportunities in this area [2][3]. Group 1: Transition from Post-training to Pre-training - The integration of RL with LLMs is seen as a significant technological advancement, extending applications from post-training to pre-training phases [2]. - LLMs traditionally rely on supervised learning, which requires extensive and accurate human-provided data, making RL a viable alternative to address these limitations [3]. - RL's ability to generate data through model-environment interaction reduces the dependency on high-quality labeled data, thus lowering the requirements for supervision [3][4]. Group 2: Applications and Innovations in RL - Initial applications of RL in LLMs were focused on post-training, with techniques like Reinforcement Learning from Human Feedback (RLHF) being prominent [4]. - Recent advancements, such as Reinforcement Pre-Training (RPT) by researchers from Microsoft and Tsinghua University, have expanded RL's application to the pre-training phase, showing improved performance on certain benchmarks [4][5]. - RPT redefines the next token prediction (NTP) task as a verifiable reasoning task, potentially unlocking RL's capabilities while reducing reliance on labeled data [5]. Group 3: Challenges and Limitations - Despite the promising developments, the known limitations of RL in LLMs are still being uncovered, indicating that while the path appears bright, significant challenges remain [4][6]. - The training data and settings for RPT have yet to be validated across broader text and foundational models, and the computational resource demands for RL training continue to pose challenges [5].
AgentAuditor: 让智能体安全评估器的精确度达到人类水平
机器之心· 2025-06-27 04:02
Core Insights - LLM Agents are evolving from mere text generators to autonomous decision-makers capable of complex task execution, raising safety concerns regarding their interactions [1] - Existing safety evaluation benchmarks for LLM Agents lack effective evaluators, struggling to assess the nuanced risks associated with complex interactions [1] - The introduction of AgentAuditor, a framework developed by researchers from multiple universities, aims to enhance the safety evaluation of LLM Agents to human expert levels [2] Evaluation Challenges - Traditional LLM safety assessments excel in content generation evaluation but fail to address the complexities of agent interactions and decision-making processes [1] - Current evaluation methods, whether rule-based or model-based, face challenges in accurately identifying subtle risks and understanding ambiguous rules [1] AgentAuditor Framework - AgentAuditor combines structured memory and retrieval-augmented reasoning (RAG) to enhance LLM evaluators' ability to learn and understand complex interaction records [4] - The framework operates through three key stages: 1. Feature Memory Construction transforms raw interaction records into a structured database containing deep semantic information [4] 2. Reasoning Memory Construction selects representative cases to generate high-quality reasoning chains that guide subsequent evaluations [5] 3. Memory-Augmented Reasoning dynamically retrieves relevant reasoning experiences to assist LLM evaluators in making precise judgments [6] ASSEBench Dataset - ASSEBench is a newly created benchmark designed to validate AgentAuditor's capabilities, consisting of 2,293 meticulously annotated real agent interaction records [9] - The benchmark covers 15 risk types, 528 interaction environments, and spans 29 application scenarios, ensuring comprehensive evaluation [9] - It employs a human-machine collaborative annotation process with strict and lenient judgment standards for nuanced risk assessment [9] Experimental Results - Extensive experiments demonstrate that AgentAuditor significantly improves LLM evaluators' performance across various datasets, achieving human-level accuracy [10][11] - For instance, the Gemini-2-Flash-Thinking model saw an F1 score increase of up to 48.2% on ASSEBench-Safety, nearing human-level performance [12] - AgentAuditor's adaptive capabilities allow it to adjust reasoning strategies based on different evaluation standards, effectively narrowing performance gaps among models [12] Conclusion - The introduction of AgentAuditor and ASSEBench provides robust evaluation tools and research foundations for building more trustworthy LLM Agents [17] - This advancement not only propels the development of LLM evaluators but also guides the future construction of safer and more reliable agent defense systems [17]
AI 开始「自由玩电脑」了!吉大提出「屏幕探索者」智能体
机器之心· 2025-06-27 04:02
Core Viewpoint - The article discusses the development of a vision-language model (VLM) agent named ScreenExplorer, which is designed to autonomously explore and interact within open graphical user interface (GUI) environments, marking a significant step towards achieving general artificial intelligence (AGI) [2][3][35]. Group 1: Breakthroughs and Innovations - The research introduces three core breakthroughs in the training of VLM agents for GUI exploration [6]. - A real-time interactive online reinforcement learning framework is established, allowing the VLM agent to interact with a live GUI environment [8][11]. - The introduction of a "curiosity mechanism" addresses the sparse feedback issue in open GUI environments, motivating the agent to explore diverse interface states [10][12]. Group 2: Training Methodology - The training involves a heuristic and world model-driven reward system that encourages exploration by providing immediate rewards for diverse actions [12][24]. - The GRPO algorithm is utilized for reinforcement learning training, calculating the advantage of actions based on rewards obtained [14][15]. - The training process allows for multiple parallel environments to synchronize reasoning, execution, and recording, enabling "learning by doing" [15]. Group 3: Experimental Results - Initial experiments show that without training, the Qwen2.5-VL-3B model fails to interact effectively with the GUI [17]. - After training, the model demonstrates improved capabilities, successfully opening applications and navigating deeper into pages [18][20]. - The ScreenExplorer models outperform general models in exploration diversity and interaction effectiveness, indicating a significant advancement in autonomous GUI interaction [22][23]. Group 4: Skill Emergence and Conclusion - The training process leads to the emergence of new skills, such as cross-modal translation and complex reasoning abilities [29][34]. - The research concludes that ScreenExplorer effectively enhances GUI interaction capabilities through a combination of exploration rewards, world models, and GRPO reinforcement learning, paving the way for more autonomous agents and progress towards AGI [35].
舍弃CUDA编程!CMU等用几十行代码将LLM编译成巨型内核,推理延迟可降6.7倍
机器之心· 2025-06-21 01:33
Core Viewpoint - The introduction of the Mirage Persistent Kernel (MPK) compiler by a team led by Zhihao Jia from CMU significantly reduces the inference latency of large language models (LLMs) by 1.2 to 6.7 times, addressing the high manual optimization costs and end-to-end delays associated with CUDA-driven LLM inference [3][4][12]. Group 1: Introduction of MPK - MPK is designed to automatically convert LLMs into optimized megakernels, which can execute the entire model without interruption, thus enhancing performance [9][10]. - The MPK compiler allows developers to compile LLMs with minimal manual effort, requiring only a few lines of Python code [5][12]. Group 2: Performance Advantages - MPK eliminates kernel launch overhead and maximizes the overlap of computation, data loading, and GPU communication, resulting in significantly lower inference latency [14][18]. - The performance improvements of MPK increase with the number of GPUs, making it particularly efficient in multi-GPU deployment scenarios [18]. Group 3: Working Mechanism of MPK - MPK consists of two main components: a compiler that transforms LLM computation graphs into fine-grained task graphs, and a runtime system that executes these task graphs within a single megakernel [19][24]. - The MPK compiler captures dependencies at a finer granularity, allowing for more aggressive pipeline optimizations compared to existing systems [26][27]. Group 4: Future Plans - The team aims to enhance MPK's usability and performance, with ongoing efforts to support dynamic workloads and advanced scheduling strategies [40][43].
2025 年了,企业的 AI 采购预算都在怎么花?
机器之心· 2025-06-20 17:04
本文来自PRO会员通讯内容,文末关注「机器之心PRO会员」,查看更多专题解读。 a16z 近期发布 2025 年度的「企业如何采购 AI」主题报告,该报告基于对全球企业高管的深度访谈与广泛调 研,揭示了 2025 年企业在以 LLM 为代表的生成式 AI 的采购、部署与预算分配上的关键趋势。 目录 01. 为何企业的 AI 预算只增不减? 为什么企业在的 AI 支出一直在增加?企业的 AI 预算构成都有哪些变化?企业部署 AI 的目的在如何转变?... 02 . 货比三家,什么样的 LLM 能让企业掏钱? 为什么企业更看重 LLM 的「差异化」而非「商业化」?为什么开源模型越来越受欢迎?大小企业选择 LLM 的偏好有何区 别?... 03. 企业如何像采购传统软件一样采购 AI 模型? 企业现在采购 AI 模型都考虑哪些因素?外部基准对 AI 采购有什么影响?... ① 该报告是 a16z 的研究主题系列之一,其研究团队此前在 2024 年 2 月发布「企业构建和购买新一代人工智能的 16 项变革」。该报告从数十位《财富》500 强企业和顶级企业的领导者和 70 多位高管进行访谈和调查,得到了 16 项核心发 ...
速递|Meta百亿美元收购Ilya遭拒,扎克伯格转身挖走SSI CEO、Siri负责人和GitHub前掌门人
Sou Hu Cai Jing· 2025-06-20 13:31
图片来源:Unsplash 在宣布以143亿美元投资人工智能初创公司 Scale AI,并挖走其创始人 Alexandr Wang 后,Meta CEO 马克·扎克伯格显然才刚刚开始他的 AI 人才收割战。 据知情人士透露, 扎克伯格的 AI 豪掷计划已进一步瞄准了 Safe Superintelligence 的 CEO、前苹果高管 Daniel Gross,以及 GitHub 前 CEO Nat Friedman。 这本不是扎克伯格最初设想的合作方式。 消息人士称,今年早些时候,Meta 曾试图直接收购 Safe Superintelligence——这家由 OpenAI 联合创始人 Ilya Sutskever 创立的公司,在今年4月的一轮融 资中估值达到了320亿美元。然而,Sutskever 不仅拒绝了收购提议,也婉拒了 Meta 对其本人的挖角邀请。 在与 Sutskever 谈判破裂后不久,扎克伯格便转向与 Gross 展开接洽。据悉,Gross 除了领导 Safe Superintelligence 外,还与 Friedman 共同创办了风投机构 NFDG(取自两人姓名首字母)。 消息称, G ...
OpenAI路线遭质疑,Meta研究员:根本无法构建超级智能
3 6 Ke· 2025-06-20 12:00
Core Insights - The pursuit of "superintelligence" represents a significant ambition among leading AI companies like Meta, OpenAI, and Google DeepMind, with substantial investments being made in this direction [1][3][4] - Sam Altman of OpenAI suggests that building superintelligence is primarily an engineering challenge, indicating a belief in a feasible path to achieve it [3][4] - Meta AI researcher Jack Morris argues that the current approach of using large language models (LLMs) and reinforcement learning (RL) may not be sufficient to construct superintelligence [1][2] Group 1: Current Approaches and Challenges - Morris outlines three potential methods for building superintelligence: purely supervised learning (SL), RL from human validators, and RL from automated validators [2] - The integration of non-text data into models is believed not to enhance overall performance, as human-written text carries intrinsic value that sensory inputs do not [2][6] - The concept of a "data wall" or "token crisis" is emerging, where the availability of text data for training LLMs is becoming a concern, leading to extensive efforts to scrape and transcribe data from various sources [8][19] Group 2: Learning Algorithms and Their Implications - The two primary learning methods identified for potential superintelligence are SL and RL, with SL being more stable and efficient for initial training [10][22] - The hypothesis that superintelligence could emerge from SL alone is challenged by the limitations of current models, which may not exhibit human-level general intelligence despite excelling in specific tasks [15][16] - The combination of SL and RL is proposed as a more viable path, leveraging human feedback or automated systems to refine model outputs [20][22][28] Group 3: Future Directions and Speculations - The potential for RL to effectively transfer learning across various tasks remains uncertain, raising questions about the scalability of this approach to achieve superintelligence [34] - The competitive landscape among AI companies is likely to intensify as they seek to develop the most effective training environments for LLMs, potentially leading to breakthroughs in superintelligence [34]