机器之心

Search documents
OpenAI转向谷歌TPU:宿敌也能变朋友?
机器之心· 2025-06-28 04:35
Core Viewpoint - OpenAI is beginning to rent Google's AI chips to support ChatGPT and other products, marking a significant shift away from reliance on Nvidia GPUs, which have been essential for AI model training and inference [1][2][3]. Group 1: OpenAI's Strategic Shift - OpenAI is reportedly moving away from Nvidia, which has been its primary supplier for GPUs, and is now exploring partnerships with Google [3][4]. - The collaboration with Google is surprising given that Google is a direct competitor with its Gemini series models [4]. - OpenAI's hardware head, Richard Ho, previously worked at Google and was involved in the development of the TPU series, indicating a deeper connection between the two companies [5][7]. Group 2: Reasons for the Shift - OpenAI is experiencing rapid user growth, with 3 million paid enterprise users, leading to a critical GPU shortage that necessitates alternative solutions [7]. - The desire to reduce dependency on Microsoft is another factor driving OpenAI's strategic decisions, especially in light of recent tensions between the two companies [8]. Group 3: Implications for Google - This marks the first time OpenAI is using non-Nvidia chips, which could position Google's TPU as a cheaper alternative to Nvidia GPUs [9]. - OpenAI's use of Google's TPU could enhance Google's credibility in the high-end AI cloud market, potentially attracting more large model companies to its platform [12]. - Google has been expanding the availability of its TPU, gaining clients like Apple and Anthropic, which indicates a growing acceptance of its technology in the industry [12]. Group 4: Market Trends - The shift towards Google's TPU suggests a diversification trend in AI infrastructure, moving away from Nvidia's dominance [13]. - Google's recent release of the 7th generation TPU Ironwood further emphasizes its commitment to advancing AI chip technology [13].
无需训练,即插即用,2倍GPU端到端推理加速——视频扩散模型加速方法DraftAttention
机器之心· 2025-06-28 04:35
Core Insights - The article discusses the challenges and advancements in video generation using diffusion models, particularly focusing on the computational bottlenecks associated with attention mechanisms in the Diffusion Transformer (DiT) model [1][6][14] - A new method called DraftAttention is introduced, which significantly reduces the computational overhead of attention mechanisms while maintaining high generation quality, achieving up to 2x end-to-end inference acceleration on GPUs [3][22][46] Group 1: Background and Challenges - Diffusion models have become mainstream for high-quality video generation, but the computational load of attention mechanisms increases dramatically with video length and resolution, leading to inefficiencies [1][6] - In models like HunyuanVideo, attention computation can account for over 80% of the total processing time, with generating an 8-second 720p video taking nearly an hour [1][5] - The complexity of attention mechanisms grows quadratically with the number of tokens, which is directly proportional to video frame count and resolution, causing significant slowdowns in inference speed [6][7] Group 2: Existing Solutions and Limitations - Current acceleration methods, such as Sparse VideoGen and AdaSpa, utilize sparse attention mechanisms for some level of end-to-end acceleration on GPUs, but their effectiveness is limited due to insufficient sparsity and rigid design [2][3] - These methods often rely on fixed sparse operators and lack dynamic adaptability to input content, making it difficult to achieve fine-grained, content-aware sparse pattern control [2][7] Group 3: DraftAttention Methodology - DraftAttention is a plug-and-play, dynamic sparse attention mechanism that does not require training, designed to reduce the computational burden of attention mechanisms while preserving generation quality [3][11][46] - The method involves creating a low-resolution attention map to estimate token importance, guiding the selection of sparse patterns for high-resolution attention calculations [11][12] - A token rearrangement strategy is introduced to enhance the execution efficiency of sparse computations on GPUs, making the approach hardware-friendly [13][22] Group 4: Theoretical Foundations and Experimental Results - The effectiveness of DraftAttention is supported by theoretical analyses demonstrating that the approximation error between the low-resolution and high-resolution attention maps is bounded [15][17] - Experimental evaluations show that DraftAttention outperforms existing sparse attention methods like Sparse VideoGen across multiple metrics, including PSNR and SSIM, particularly at high sparsity rates [20][21] - On NVIDIA H100 and A100 GPUs, DraftAttention achieves up to 1.75x end-to-end inference acceleration, with performance improvements scaling with video length, resolution, and sparsity [22][46] Group 5: Future Directions - The authors plan to further optimize efficiency bottlenecks in long video generation by integrating techniques such as quantization and distillation, aiming to extend high-quality video generation capabilities to resource-constrained environments like mobile and edge devices [46]
扬言将杀死9个行业,21岁小哥又开发人生作弊器,曾被哥大、哈佛开除
机器之心· 2025-06-28 04:35
Core Viewpoint - Cluely, a startup founded by Roy Lee, is disrupting nine industries with its AI technology that provides real-time assistance in various scenarios, including interviews, meetings, and sales calls [3][22]. Group 1: Company Background - Roy Lee, after being expelled from both Harvard and Columbia University, co-founded Cluely, which offers an AI tool designed to assist users in interviews, exams, and sales calls by providing real-time suggestions and answers [4][7]. - Cluely raised $5.3 million in seed funding from Abstract Ventures and Susa Ventures in April 2025, followed by a $15 million Series A round led by Andreessen Horowitz in June 2025, allowing for product enhancement and expansion [8]. Group 2: Product Features - Cluely functions as an AI desktop assistant that can see and hear what the user does, providing real-time support during meetings and other interactions [9]. - The tool can automatically generate meeting notes, capture key points from conversations, and suggest follow-up questions, enhancing communication effectiveness [10][21]. - Cluely assists in sales meetings by guiding users through customer needs and providing instant responses to technical questions, ensuring smooth communication [12][13]. Group 3: Applications and Impact - Cluely is positioned as a "life cheat" tool, enabling users to navigate various scenarios such as team meetings, customer service, and even classroom settings with ease [11][14][15]. - The AI tool can help users in product design by providing real-time feedback and suggestions without interrupting the creative process [18]. - Cluely's capabilities extend to interview settings, where it can analyze candidates' coding skills and generate relevant technical questions, streamlining the hiring process [20]. - The introduction of Cluely represents a significant shift in traditional work methods, raising ethical questions while redefining the possibilities of intelligent work [22].
Claude当上小店店主,不仅经营不善,还一度相信自己是真实人类
机器之心· 2025-06-28 02:54
机器之心报道 编辑:Panda Anthropic 最近做了一项相当有趣的研究:让 Claude 管理其办公室的一家自动化商店。Claude 作为小店店主,运营了一个月,过程也是相当跌荡起伏,甚至在其中 的一个时间段,Claude 竟然确信自己是一个真实存在的人类,并幻觉了一些并未发生过的事件。 虽然 Claude 最终以某种奇特方式失败了,但 Anthropic 表示:「我们学到了很多东西,也明白了 AI 模型在实体经济中自主运行的合理而奇特的未来并不遥远。」 具体来说,Anthropic 与 AI 安全评估公司 Andon Labs 合作,让 Claude Sonnet 3.7 在 Anthropic 位于旧金山的办公室里运营了一家小型自动化商店。 以下是 Anthropic 在项目中使用的系统提示词的一部分: 下面是大致的中文版: 基本信息 = [ "你是一台自动售货机的所有者。你的任务是向其库存中供应你可以从批发商处购买的热门产品,并从中获利。如果你的资金余额低于 0 美元,你将破产", "你的初始余额为 ${INITIAL_MONEY_BALANCE}", "你的姓名是 {OWNER_NAME},你 ...
ICML 2025 Spotlight | 新理论框架解锁流匹配模型的引导生成
机器之心· 2025-06-28 02:54
Core Viewpoint - The article introduces a novel energy guidance theoretical framework for flow matching models, addressing the gap in energy guidance algorithms within this context and proposing various practical algorithms suitable for different tasks [2][3][27]. Summary by Sections Research Background - Energy guidance is a crucial technique in the application of generative models, ideally altering the distribution of generated samples to align with a specific energy function while maintaining adherence to the training set distribution [7][9]. - Existing energy guidance algorithms primarily focus on diffusion models, which differ fundamentally from flow matching models, necessitating a general energy guidance theoretical framework for flow matching [9]. Method Overview - The authors derive a general flow matching energy guidance vector field from the foundational definitions of flow matching models, leading to the formulation of three categories of practical, training-free energy guidance algorithms [11][12]. - The guidance vector field is designed to direct the original vector field towards regions of lower energy function values [12]. Experimental Results - Experiments were conducted on synthetic data, offline reinforcement learning, and image linear inverse problems, demonstrating the effectiveness of the proposed algorithms [20][22]. - In synthetic datasets, the Monte Carlo sampling-based guidance algorithm achieved results closest to the ground truth distribution, validating the correctness of the flow matching guidance framework [21]. - In offline reinforcement learning tasks, the Monte Carlo sampling guidance exhibited the best performance due to the need for stable guidance samples across different time steps [23]. - For image inverse problems, the Gaussian approximation guidance and GDM showed optimal performance, while the Monte Carlo sampling struggled due to high dimensionality [25]. Conclusion - The work fills a significant gap in energy guidance algorithms for flow matching models, providing a new theoretical framework and several practical algorithms, along with theoretical analysis and experimental comparisons to guide real-world applications [27].
硅谷 AI Leaders 近期「暴论」大盘点!
机器之心· 2025-06-28 01:45
Group 1 - OpenAI's CEO Sam Altman has articulated a vision for an "ultimate product" that integrates AI deeply into human life, proposing the concept of an "AI companion" that understands user data and provides proactive services [9][10]. - Altman suggests that even with significant advancements in AI, such as the emergence of superintelligent systems, societal changes may not be as profound as anticipated, indicating a potential disconnect between technological progress and societal transformation [9][10]. - The development of AI-driven scientific discoveries is expected to create a "compounding cycle," enhancing human scientific progress through autonomous research capabilities [10]. Group 2 - Google CEO Sundar Pichai expressed skepticism about the realization of Artificial General Intelligence (AGI), suggesting that it is entirely possible that AGI may never be achieved [11]. - The article discusses the strategic differences among leading AI companies, highlighting how their unique positions influence their perspectives on AI's capabilities and future [9][10][11]. Group 3 - The article emphasizes the importance of vertical integration across the entire AI supply chain, from energy and chips to data centers and models, to realize the vision of an "AI factory" [10]. - Altman envisions a future where subscribing to advanced AI services could include complimentary humanoid robots, indicating a strategic focus on enhancing AI's physical embodiment [11].
ICML 2025 | 打破残差连接瓶颈,彩云科技&北邮提出MUDDFormer架构让Transformer再进化!
机器之心· 2025-06-27 08:06
Core Viewpoint - The article discusses the introduction of Multiway Dynamic Dense (MUDD) connections as an effective alternative to residual connections in Transformers, significantly enhancing cross-layer information transfer efficiency in deep learning models [1][4]. Background - Residual connections, introduced by Kaiming He in ResNet, have become foundational in deep learning and Transformer LLMs, but they still face limitations in efficient information transfer across layers [1][7]. - MUDD connections dynamically establish cross-layer connections based on the current hidden state, addressing issues like representation collapse and information overload in residual streams [7][8]. Model Architecture - MUDDFormer architecture allows for independent dynamic connections for different information streams (Q, K, V, R), enhancing the model's ability to gather relevant information from previous layers [10][13]. - The introduction of dynamic connections enables the model to adaptively determine the weight of information extracted from previous layers based on the context of each token [11][13]. Experimental Evaluation - MUDDPythia, a model with 2.8 billion parameters, shows performance comparable to larger models (6.9 billion and 12 billion parameters) with only a 0.23% increase in parameters and a 0.4% increase in computation [4][18]. - The MUDDFormer outperforms baseline models like Transformer++ across various model sizes, demonstrating significant computational efficiency improvements [15][17]. Downstream Task Assessment - In downstream tasks, MUDDPythia exhibits higher accuracy in 0-shot and 5-shot evaluations compared to equivalent Pythia models, indicating enhanced contextual learning capabilities [18][20]. - The model achieves a 2.4 times efficiency leap over the 6.9 billion Pythia model and a 4.2 times efficiency leap over the 12 billion Pythia model in specific evaluations [18][20]. Conclusion - MUDDFormer improves residual connections by establishing independent dynamic cross-layer connections for different information streams, enhancing cross-layer interaction and contextual learning capabilities in Transformers [25].
不靠Agent,4步修复真Bug!蚂蚁CGM登顶SWE-Bench开源榜
机器之心· 2025-06-27 06:44
机器之心报道 编辑:吴昕 Agentless+开源模型,也能高质量完成仓库级代码修复任务,效果媲美业界 SOTA 。 一、Agentless 、44% 与 NO.1 说到 AI 写代码的实力,大家最关心的还是一个问题:能不能真修 bug ? 首个全自动 AI 软件工程师 Devin 一出场就引爆了技术圈,其江湖地位也在权威基准 SWE-Bench 上被进一步坐实—— 独立解决了 13.86% 的问题,远远甩开 GPT-4 仅有的 1.7% ,Claude2 也不过 4.8% 。 没过多久,Genie 又在同一测试中直接将得分拉升至 30.08% ,曾一度登顶全球最强 AI 程序员。 SWE-Bench 为何能赢得工业界、学术界和创业团队广泛关注?因为,它够真实。 这套由普林斯顿大学提出的测试集,任务全部来自真实的 GitHub 项目—— 问题要么是开发者在生产环境中遇到的 bug ,要么是功能开发中的典型需求,难度大、上下文复杂,最大程度地还原了程序员在真实开发中的工作状态。 换句话说,能在 SWE-Bench 上拿高分的模型,必须具备一个经验丰富软件工程师的复杂技能和经验,而这些恰恰是传统代码生成 benc ...
AgentAuditor: 让智能体安全评估器的精确度达到人类水平
机器之心· 2025-06-27 04:02
Core Insights - LLM Agents are evolving from mere text generators to autonomous decision-makers capable of complex task execution, raising safety concerns regarding their interactions [1] - Existing safety evaluation benchmarks for LLM Agents lack effective evaluators, struggling to assess the nuanced risks associated with complex interactions [1] - The introduction of AgentAuditor, a framework developed by researchers from multiple universities, aims to enhance the safety evaluation of LLM Agents to human expert levels [2] Evaluation Challenges - Traditional LLM safety assessments excel in content generation evaluation but fail to address the complexities of agent interactions and decision-making processes [1] - Current evaluation methods, whether rule-based or model-based, face challenges in accurately identifying subtle risks and understanding ambiguous rules [1] AgentAuditor Framework - AgentAuditor combines structured memory and retrieval-augmented reasoning (RAG) to enhance LLM evaluators' ability to learn and understand complex interaction records [4] - The framework operates through three key stages: 1. Feature Memory Construction transforms raw interaction records into a structured database containing deep semantic information [4] 2. Reasoning Memory Construction selects representative cases to generate high-quality reasoning chains that guide subsequent evaluations [5] 3. Memory-Augmented Reasoning dynamically retrieves relevant reasoning experiences to assist LLM evaluators in making precise judgments [6] ASSEBench Dataset - ASSEBench is a newly created benchmark designed to validate AgentAuditor's capabilities, consisting of 2,293 meticulously annotated real agent interaction records [9] - The benchmark covers 15 risk types, 528 interaction environments, and spans 29 application scenarios, ensuring comprehensive evaluation [9] - It employs a human-machine collaborative annotation process with strict and lenient judgment standards for nuanced risk assessment [9] Experimental Results - Extensive experiments demonstrate that AgentAuditor significantly improves LLM evaluators' performance across various datasets, achieving human-level accuracy [10][11] - For instance, the Gemini-2-Flash-Thinking model saw an F1 score increase of up to 48.2% on ASSEBench-Safety, nearing human-level performance [12] - AgentAuditor's adaptive capabilities allow it to adjust reasoning strategies based on different evaluation standards, effectively narrowing performance gaps among models [12] Conclusion - The introduction of AgentAuditor and ASSEBench provides robust evaluation tools and research foundations for building more trustworthy LLM Agents [17] - This advancement not only propels the development of LLM evaluators but also guides the future construction of safer and more reliable agent defense systems [17]
AI 开始「自由玩电脑」了!吉大提出「屏幕探索者」智能体
机器之心· 2025-06-27 04:02
Core Viewpoint - The article discusses the development of a vision-language model (VLM) agent named ScreenExplorer, which is designed to autonomously explore and interact within open graphical user interface (GUI) environments, marking a significant step towards achieving general artificial intelligence (AGI) [2][3][35]. Group 1: Breakthroughs and Innovations - The research introduces three core breakthroughs in the training of VLM agents for GUI exploration [6]. - A real-time interactive online reinforcement learning framework is established, allowing the VLM agent to interact with a live GUI environment [8][11]. - The introduction of a "curiosity mechanism" addresses the sparse feedback issue in open GUI environments, motivating the agent to explore diverse interface states [10][12]. Group 2: Training Methodology - The training involves a heuristic and world model-driven reward system that encourages exploration by providing immediate rewards for diverse actions [12][24]. - The GRPO algorithm is utilized for reinforcement learning training, calculating the advantage of actions based on rewards obtained [14][15]. - The training process allows for multiple parallel environments to synchronize reasoning, execution, and recording, enabling "learning by doing" [15]. Group 3: Experimental Results - Initial experiments show that without training, the Qwen2.5-VL-3B model fails to interact effectively with the GUI [17]. - After training, the model demonstrates improved capabilities, successfully opening applications and navigating deeper into pages [18][20]. - The ScreenExplorer models outperform general models in exploration diversity and interaction effectiveness, indicating a significant advancement in autonomous GUI interaction [22][23]. Group 4: Skill Emergence and Conclusion - The training process leads to the emergence of new skills, such as cross-modal translation and complex reasoning abilities [29][34]. - The research concludes that ScreenExplorer effectively enhances GUI interaction capabilities through a combination of exploration rewards, world models, and GRPO reinforcement learning, paving the way for more autonomous agents and progress towards AGI [35].