大语言模型(LLM)

Search documents
为什么说大多数LLM初创企业注定都将失败?
3 6 Ke· 2025-06-30 07:13
Group 1 - The AI startup ecosystem is facing a harsh reality, with many companies mistakenly believing they are building on a stable platform provided by large language models (LLMs), when in fact they are nesting within predators [2][4] - The core illusion of modularity in the LLM startup boom is flawed, as model suppliers are not neutral layers but vertically integrated companies that control user interfaces and distribution channels [3][4] - The influx of venture capital into LLM-based startups has led to a strategic miscalculation, conflating the ease of prototype development with the sustainability of business models [4][5] Group 2 - Some startups may survive the collapse by possessing irreplaceable competitive advantages, such as distribution barriers, proprietary data, or control over inference [5][6] - The allure of the LLM shell model is rooted in its perceived advantages in a capital-driven environment, but it obscures the fundamental strategic flaw of lacking control over value engines [7][8] - The behavior of model suppliers reflects rational choices typical of monopolistic enterprises, as they seek to expand upstream and capture profits rather than serve as passive infrastructure [6][8] Group 3 - Founders must critically assess their reliance on others' LLMs and consider their business positioning, asking key questions about their unique advantages and potential vulnerabilities [8][9] - The new decision-making criteria for startups include rapid prototyping, quick iterations, and minimal cash burn, emphasizing the need for a solid foundation beyond mere API usage [8][10] - The era of LLM shell products has ended, and the new landscape favors those who control data, distribution, and infrastructure as the true competitive barriers [12]
从后训练回到预训练,LLM+RL 的潜力兑现有有机会走更远吗?
机器之心· 2025-06-28 05:22
Core Insights - The article discusses the potential of combining Reinforcement Learning (RL) with Large Language Models (LLMs), particularly focusing on the transition from post-training to pre-training phases, highlighting the challenges and opportunities in this area [2][3]. Group 1: Transition from Post-training to Pre-training - The integration of RL with LLMs is seen as a significant technological advancement, extending applications from post-training to pre-training phases [2]. - LLMs traditionally rely on supervised learning, which requires extensive and accurate human-provided data, making RL a viable alternative to address these limitations [3]. - RL's ability to generate data through model-environment interaction reduces the dependency on high-quality labeled data, thus lowering the requirements for supervision [3][4]. Group 2: Applications and Innovations in RL - Initial applications of RL in LLMs were focused on post-training, with techniques like Reinforcement Learning from Human Feedback (RLHF) being prominent [4]. - Recent advancements, such as Reinforcement Pre-Training (RPT) by researchers from Microsoft and Tsinghua University, have expanded RL's application to the pre-training phase, showing improved performance on certain benchmarks [4][5]. - RPT redefines the next token prediction (NTP) task as a verifiable reasoning task, potentially unlocking RL's capabilities while reducing reliance on labeled data [5]. Group 3: Challenges and Limitations - Despite the promising developments, the known limitations of RL in LLMs are still being uncovered, indicating that while the path appears bright, significant challenges remain [4][6]. - The training data and settings for RPT have yet to be validated across broader text and foundational models, and the computational resource demands for RL training continue to pose challenges [5].
AgentAuditor: 让智能体安全评估器的精确度达到人类水平
机器之心· 2025-06-27 04:02
Core Insights - LLM Agents are evolving from mere text generators to autonomous decision-makers capable of complex task execution, raising safety concerns regarding their interactions [1] - Existing safety evaluation benchmarks for LLM Agents lack effective evaluators, struggling to assess the nuanced risks associated with complex interactions [1] - The introduction of AgentAuditor, a framework developed by researchers from multiple universities, aims to enhance the safety evaluation of LLM Agents to human expert levels [2] Evaluation Challenges - Traditional LLM safety assessments excel in content generation evaluation but fail to address the complexities of agent interactions and decision-making processes [1] - Current evaluation methods, whether rule-based or model-based, face challenges in accurately identifying subtle risks and understanding ambiguous rules [1] AgentAuditor Framework - AgentAuditor combines structured memory and retrieval-augmented reasoning (RAG) to enhance LLM evaluators' ability to learn and understand complex interaction records [4] - The framework operates through three key stages: 1. Feature Memory Construction transforms raw interaction records into a structured database containing deep semantic information [4] 2. Reasoning Memory Construction selects representative cases to generate high-quality reasoning chains that guide subsequent evaluations [5] 3. Memory-Augmented Reasoning dynamically retrieves relevant reasoning experiences to assist LLM evaluators in making precise judgments [6] ASSEBench Dataset - ASSEBench is a newly created benchmark designed to validate AgentAuditor's capabilities, consisting of 2,293 meticulously annotated real agent interaction records [9] - The benchmark covers 15 risk types, 528 interaction environments, and spans 29 application scenarios, ensuring comprehensive evaluation [9] - It employs a human-machine collaborative annotation process with strict and lenient judgment standards for nuanced risk assessment [9] Experimental Results - Extensive experiments demonstrate that AgentAuditor significantly improves LLM evaluators' performance across various datasets, achieving human-level accuracy [10][11] - For instance, the Gemini-2-Flash-Thinking model saw an F1 score increase of up to 48.2% on ASSEBench-Safety, nearing human-level performance [12] - AgentAuditor's adaptive capabilities allow it to adjust reasoning strategies based on different evaluation standards, effectively narrowing performance gaps among models [12] Conclusion - The introduction of AgentAuditor and ASSEBench provides robust evaluation tools and research foundations for building more trustworthy LLM Agents [17] - This advancement not only propels the development of LLM evaluators but also guides the future construction of safer and more reliable agent defense systems [17]
AI 开始「自由玩电脑」了!吉大提出「屏幕探索者」智能体
机器之心· 2025-06-27 04:02
Core Viewpoint - The article discusses the development of a vision-language model (VLM) agent named ScreenExplorer, which is designed to autonomously explore and interact within open graphical user interface (GUI) environments, marking a significant step towards achieving general artificial intelligence (AGI) [2][3][35]. Group 1: Breakthroughs and Innovations - The research introduces three core breakthroughs in the training of VLM agents for GUI exploration [6]. - A real-time interactive online reinforcement learning framework is established, allowing the VLM agent to interact with a live GUI environment [8][11]. - The introduction of a "curiosity mechanism" addresses the sparse feedback issue in open GUI environments, motivating the agent to explore diverse interface states [10][12]. Group 2: Training Methodology - The training involves a heuristic and world model-driven reward system that encourages exploration by providing immediate rewards for diverse actions [12][24]. - The GRPO algorithm is utilized for reinforcement learning training, calculating the advantage of actions based on rewards obtained [14][15]. - The training process allows for multiple parallel environments to synchronize reasoning, execution, and recording, enabling "learning by doing" [15]. Group 3: Experimental Results - Initial experiments show that without training, the Qwen2.5-VL-3B model fails to interact effectively with the GUI [17]. - After training, the model demonstrates improved capabilities, successfully opening applications and navigating deeper into pages [18][20]. - The ScreenExplorer models outperform general models in exploration diversity and interaction effectiveness, indicating a significant advancement in autonomous GUI interaction [22][23]. Group 4: Skill Emergence and Conclusion - The training process leads to the emergence of new skills, such as cross-modal translation and complex reasoning abilities [29][34]. - The research concludes that ScreenExplorer effectively enhances GUI interaction capabilities through a combination of exploration rewards, world models, and GRPO reinforcement learning, paving the way for more autonomous agents and progress towards AGI [35].
舍弃CUDA编程!CMU等用几十行代码将LLM编译成巨型内核,推理延迟可降6.7倍
机器之心· 2025-06-21 01:33
Core Viewpoint - The introduction of the Mirage Persistent Kernel (MPK) compiler by a team led by Zhihao Jia from CMU significantly reduces the inference latency of large language models (LLMs) by 1.2 to 6.7 times, addressing the high manual optimization costs and end-to-end delays associated with CUDA-driven LLM inference [3][4][12]. Group 1: Introduction of MPK - MPK is designed to automatically convert LLMs into optimized megakernels, which can execute the entire model without interruption, thus enhancing performance [9][10]. - The MPK compiler allows developers to compile LLMs with minimal manual effort, requiring only a few lines of Python code [5][12]. Group 2: Performance Advantages - MPK eliminates kernel launch overhead and maximizes the overlap of computation, data loading, and GPU communication, resulting in significantly lower inference latency [14][18]. - The performance improvements of MPK increase with the number of GPUs, making it particularly efficient in multi-GPU deployment scenarios [18]. Group 3: Working Mechanism of MPK - MPK consists of two main components: a compiler that transforms LLM computation graphs into fine-grained task graphs, and a runtime system that executes these task graphs within a single megakernel [19][24]. - The MPK compiler captures dependencies at a finer granularity, allowing for more aggressive pipeline optimizations compared to existing systems [26][27]. Group 4: Future Plans - The team aims to enhance MPK's usability and performance, with ongoing efforts to support dynamic workloads and advanced scheduling strategies [40][43].
2025 年了,企业的 AI 采购预算都在怎么花?
机器之心· 2025-06-20 17:04
本文来自PRO会员通讯内容,文末关注「机器之心PRO会员」,查看更多专题解读。 a16z 近期发布 2025 年度的「企业如何采购 AI」主题报告,该报告基于对全球企业高管的深度访谈与广泛调 研,揭示了 2025 年企业在以 LLM 为代表的生成式 AI 的采购、部署与预算分配上的关键趋势。 目录 01. 为何企业的 AI 预算只增不减? 为什么企业在的 AI 支出一直在增加?企业的 AI 预算构成都有哪些变化?企业部署 AI 的目的在如何转变?... 02 . 货比三家,什么样的 LLM 能让企业掏钱? 为什么企业更看重 LLM 的「差异化」而非「商业化」?为什么开源模型越来越受欢迎?大小企业选择 LLM 的偏好有何区 别?... 03. 企业如何像采购传统软件一样采购 AI 模型? 企业现在采购 AI 模型都考虑哪些因素?外部基准对 AI 采购有什么影响?... ① 该报告是 a16z 的研究主题系列之一,其研究团队此前在 2024 年 2 月发布「企业构建和购买新一代人工智能的 16 项变革」。该报告从数十位《财富》500 强企业和顶级企业的领导者和 70 多位高管进行访谈和调查,得到了 16 项核心发 ...
速递|Meta百亿美元收购Ilya遭拒,扎克伯格转身挖走SSI CEO、Siri负责人和GitHub前掌门人
Sou Hu Cai Jing· 2025-06-20 13:31
图片来源:Unsplash 在宣布以143亿美元投资人工智能初创公司 Scale AI,并挖走其创始人 Alexandr Wang 后,Meta CEO 马克·扎克伯格显然才刚刚开始他的 AI 人才收割战。 据知情人士透露, 扎克伯格的 AI 豪掷计划已进一步瞄准了 Safe Superintelligence 的 CEO、前苹果高管 Daniel Gross,以及 GitHub 前 CEO Nat Friedman。 这本不是扎克伯格最初设想的合作方式。 消息人士称,今年早些时候,Meta 曾试图直接收购 Safe Superintelligence——这家由 OpenAI 联合创始人 Ilya Sutskever 创立的公司,在今年4月的一轮融 资中估值达到了320亿美元。然而,Sutskever 不仅拒绝了收购提议,也婉拒了 Meta 对其本人的挖角邀请。 在与 Sutskever 谈判破裂后不久,扎克伯格便转向与 Gross 展开接洽。据悉,Gross 除了领导 Safe Superintelligence 外,还与 Friedman 共同创办了风投机构 NFDG(取自两人姓名首字母)。 消息称, G ...
OpenAI路线遭质疑,Meta研究员:根本无法构建超级智能
3 6 Ke· 2025-06-20 12:00
Core Insights - The pursuit of "superintelligence" represents a significant ambition among leading AI companies like Meta, OpenAI, and Google DeepMind, with substantial investments being made in this direction [1][3][4] - Sam Altman of OpenAI suggests that building superintelligence is primarily an engineering challenge, indicating a belief in a feasible path to achieve it [3][4] - Meta AI researcher Jack Morris argues that the current approach of using large language models (LLMs) and reinforcement learning (RL) may not be sufficient to construct superintelligence [1][2] Group 1: Current Approaches and Challenges - Morris outlines three potential methods for building superintelligence: purely supervised learning (SL), RL from human validators, and RL from automated validators [2] - The integration of non-text data into models is believed not to enhance overall performance, as human-written text carries intrinsic value that sensory inputs do not [2][6] - The concept of a "data wall" or "token crisis" is emerging, where the availability of text data for training LLMs is becoming a concern, leading to extensive efforts to scrape and transcribe data from various sources [8][19] Group 2: Learning Algorithms and Their Implications - The two primary learning methods identified for potential superintelligence are SL and RL, with SL being more stable and efficient for initial training [10][22] - The hypothesis that superintelligence could emerge from SL alone is challenged by the limitations of current models, which may not exhibit human-level general intelligence despite excelling in specific tasks [15][16] - The combination of SL and RL is proposed as a more viable path, leveraging human feedback or automated systems to refine model outputs [20][22][28] Group 3: Future Directions and Speculations - The potential for RL to effectively transfer learning across various tasks remains uncertain, raising questions about the scalability of this approach to achieve superintelligence [34] - The competitive landscape among AI companies is likely to intensify as they seek to develop the most effective training environments for LLMs, potentially leading to breakthroughs in superintelligence [34]
Andrej Karpathy:警惕"Agent之年"炒作,主动为AI改造数字infra | Jinqiu Select
锦秋集· 2025-06-20 09:08
Core Viewpoint - The future of AI requires a "ten-year patience" and a focus on developing "Iron Man suit" style enhancement tools rather than fully autonomous robots [3][30][34]. Group 1: Software Evolution - The software industry is undergoing a fundamental transformation, moving from Software 1.0 (human-written code) to Software 2.0 (neural networks) and now to Software 3.0 (using natural language as a programming interface) [6][10][11]. - Software 1.0 is characterized by traditional programming, while Software 2.0 relies on neural networks trained on datasets, and Software 3.0 allows interaction through prompts in natural language [8][10][11]. Group 2: LLM as a New Operating System - Large Language Models (LLMs) can be viewed as a new operating system, with LLMs acting as the "CPU" for reasoning and context windows serving as "memory" [12][15]. - The development of LLMs requires significant capital investment, similar to building power plants and grids, and they are expected to provide services through APIs [12][13]. Group 3: LLM's Capabilities and Limitations - LLMs possess encyclopedic knowledge and memory but also exhibit cognitive flaws such as hallucinations, jagged intelligence, anterograde amnesia, and vulnerability to security threats [16][20]. - The dual nature of LLMs necessitates careful design of workflows to leverage their strengths while mitigating their weaknesses [20]. Group 4: Partial Autonomy Applications - The development of partial autonomy applications is a key opportunity, allowing for efficient human-AI collaboration [21][23]. - Successful applications like Cursor and Perplexity demonstrate the importance of context management, multi-model orchestration, and user-friendly interfaces [21][22]. Group 5: Vibe Coding and Deployment Challenges - LLMs democratize programming through natural language, but the real challenge lies in deploying functional applications due to existing infrastructure designed for human interaction [24][25]. - The bottleneck has shifted from coding to deployment, highlighting the need for redesigning digital infrastructure to accommodate AI agents [25][26]. Group 6: Infrastructure for AI Agents - The digital world is currently designed for human users and traditional programs, neglecting the needs of AI agents [27][28]. - Proposed solutions include creating direct communication channels, rewriting documentation for AI compatibility, and developing tools that translate human-centric information for AI consumption [28][29]. Group 7: Realistic Outlook on AI Development - The journey towards AI advancement is a long-term endeavor requiring patience and a focus on enhancing tools rather than rushing towards full autonomy [30][31]. - The analogy of the "Iron Man suit" illustrates the spectrum of autonomy, emphasizing the importance of developing reliable enhancement tools in the current phase [33][34].
Andrej Karpathy最新演讲爆火!人类已进入「说话就能编程」的软件3.0时代
机器之心· 2025-06-20 00:58
Core Viewpoint - The article discusses the evolution of software in the context of AI, particularly focusing on the transition to "Software 3.0," where natural language becomes the new programming interface, and large language models (LLMs) play a central role in software development [6][8][25]. Group 1: Evolution of Software - Software development is categorized into three phases: Software 1.0 (manual coding), Software 2.0 (neural network weights), and Software 3.0 (LLMs as programming interfaces) [8][25]. - The current shift signifies a transformation where LLMs are viewed as a new type of operating system, centralizing computational power in the cloud and allowing users to interact through natural language [14][48]. Group 2: Characteristics of LLMs - LLMs are described as "defective superheroes," possessing vast knowledge but prone to errors and lacking long-term memory, necessitating careful supervision in their application [14][88]. - The article emphasizes the need for a redesign of digital infrastructure to make it more machine-readable, facilitating the development of advanced AI systems [14][38]. Group 3: Opportunities in AI Applications - The concept of "partial autonomy" in applications is introduced, where tools like Cursor and Perplexity exemplify how LLMs can enhance human capabilities while maintaining user control [101][107]. - The importance of user-friendly graphical interfaces (GUIs) is highlighted, as they improve the efficiency of human oversight in AI-generated outputs [104][117]. Group 4: Future of Programming - The emergence of "vibe coding" is noted, where individuals can create software by describing problems in natural language, thus democratizing programming [138][144]. - The article suggests that the future of software development will involve creating tools that are friendly to LLMs, enabling seamless interaction and enhancing productivity [170][179].