Workflow
Hugging Face
icon
Search documents
DeepAgent与DeepSearch双双霸榜,答案指向openJiuwen这一新兴开源项目
3 6 Ke· 2026-02-12 07:06
Core Insights - The article highlights the emergence of advanced AI agents, particularly focusing on DeepAgent and DeepSearch, which have achieved top rankings in the GAIA and BrowseComp-Plus benchmarks respectively, indicating a significant leap in AI capabilities [1][20]. Group 1: GAIA Benchmark Insights - DeepAgent, built on the openJiuwen platform, achieved a score of 91.69%, surpassing competitors like NVIDIA's Nemotron, showcasing its superior capabilities in general agent tasks [2][10]. - GAIA is a rigorous benchmark designed to evaluate AI agents on 12 core competencies, including long-term task planning and multi-modal understanding, with a scoring system that emphasizes real-world task execution [6][4]. - The average success rate for human participants in GAIA is around 92%, while leading AI models like GPT-4 only achieve about 15%, highlighting the benchmark's challenging nature [6][10]. Group 2: DeepAgent's Capabilities - DeepAgent's design allows it to dynamically adjust plans based on real-time feedback, ensuring task completion even in changing environments [12][13]. - It features a multi-layered context engine that maintains cognitive consistency and traceability throughout complex tasks, enhancing the reliability of its outputs [15]. - The agent employs an asynchronous tool orchestration system, enabling efficient and reliable execution of diverse tasks by coordinating various external tools [16][17]. Group 3: BrowseComp-Plus Benchmark Insights - DeepSearch, also based on openJiuwen, achieved an accuracy of 80% in the BrowseComp-Plus benchmark, demonstrating its strength in deep search and web interaction capabilities [20][24]. - BrowseComp-Plus evaluates agents on their ability to perform multi-hop retrieval and cross-source information integration, making it a critical measure of an agent's practical capabilities [23][24]. - The benchmark employs a fixed human-validated corpus to ensure fairness and reproducibility in its assessments, avoiding biases from real-time web dynamics [23]. Group 4: Technological Foundation - Both DeepAgent and DeepSearch leverage the openJiuwen platform, which provides a comprehensive framework for developing high-precision, high-efficiency AI agents [30][31]. - openJiuwen supports multi-agent collaboration and self-evolution, allowing agents to continuously improve their performance through a closed-loop optimization process [31][32]. - The platform has already been commercialized in various sectors, including finance and manufacturing, indicating its broad applicability and potential for future growth [31].
DeepAgent与DeepSearch双双霸榜!答案指向openJiuwen这一新兴开源项目
机器之心· 2026-02-12 05:16
Core Insights - The article highlights the emergence of advanced AI agents, particularly focusing on Clawdbot and its evolution into OpenClaw, reflecting a global desire for more sophisticated and reliable AI systems [1] - The year 2025 is referred to as the "Year of AI Agents," with numerous agents being developed and evaluated against rigorous benchmarks like GAIA and BrowseComp-Plus [1][2] - DeepAgent and DeepSearch, built on the openJiuwen platform, have achieved top rankings in the GAIA and BrowseComp-Plus benchmarks, respectively, showcasing their advanced capabilities [2][25] GAIA Benchmark Insights - DeepAgent achieved a score of 91.69%, surpassing competitors like NVIDIA's Nemotron, indicating its strong performance in general agent capabilities [4][13] - GAIA evaluates agents on 12 core abilities, including long-term task planning and multi-modal understanding, with a scoring system that emphasizes real-world task difficulty [8][10] - The average success rate for human participants in GAIA is around 92%, while leading AI models like GPT-4 perform significantly lower, highlighting the challenge faced by AI agents [9] DeepAgent's Capabilities - DeepAgent's design allows it to dynamically adjust plans based on real-time feedback, ensuring task completion even in changing environments [17] - It features a multi-layered context engine that maintains consistency and traceability in reasoning, crucial for complex tasks [19][21] - The agent's ability to execute tasks, such as analyzing YouTube cooking videos and purchasing ingredients, demonstrates its practical application in real-world scenarios [15] BrowseComp-Plus Benchmark Insights - DeepSearch achieved an accuracy of 80%, leading the BrowseComp-Plus ranking, which assesses deep search and web browsing capabilities [26][29] - The BrowseComp-Plus benchmark focuses on multi-hop retrieval and cross-source information integration, emphasizing the agent's ability to extract relevant information from vast datasets [29][30] - The scoring mechanism is designed to ensure fairness and reproducibility, using a fixed human-validated corpus to avoid biases from real-time web dynamics [30] DeepSearch's Capabilities - DeepSearch employs a multi-branch reasoning approach, allowing it to explore various potential solutions simultaneously, enhancing search efficiency [35] - It features an intelligent action exploration system that balances the depth of search with the diversity of paths taken, addressing the challenges of noise and misinformation [37][39] - The system's design mimics human expert reasoning, enabling it to adaptively prioritize search actions based on real-time evaluations [39][40] openJiuwen Platform Insights - Both DeepAgent and DeepSearch leverage the openJiuwen platform, which provides a comprehensive framework for developing high-precision, controllable AI agents [41][42] - The platform supports multi-agent collaboration and self-evolution, allowing for continuous improvement and adaptability in task execution [43] - openJiuwen has been commercialized in various sectors, including finance and manufacturing, indicating its broad applicability and potential for industry transformation [43] Conclusion - The article concludes that the AI agent landscape is at a pivotal point, distinguishing between basic language-interactive agents and advanced systems capable of planning, resource scheduling, and self-repair [46] - The success of DeepAgent and DeepSearch underscores the importance of robust architectural design in achieving high performance in stringent evaluations [46][48]
Hugging Face曾拒绝英伟达5亿美元投资:不想看单一巨头脸色
Sou Hu Cai Jing· 2026-01-29 12:38
Core Viewpoint - Hugging Face, an AI startup, unexpectedly rejected a $500 million investment offer from Nvidia, aiming to avoid a single dominant investor influencing its decisions [1][3]. Group 1: Company Overview - Hugging Face operates a platform hosting 2.5 million public AI models and over 700,000 public datasets, allowing users to download freely [3]. - The company has 13 million global users and promotes open-source models for developers, contrasting with major players like OpenAI and Google, which focus on proprietary models [3][4]. - Hugging Face has raised a total of $400 million, with a valuation of $4.5 billion in 2023, and retains half of its funds on hand [4]. Group 2: Business Model and Financials - The company employs a "freemium" business model, with about 3% of customers, typically large enterprises, paying for additional features [4]. - Hugging Face aims for profitability by 2025 but reported a loss in the first quarter of this year due to investment in datasets [4]. - The company does not prioritize revenue maximization but encourages developers to provide open-source alternatives for text, image, and visual models [4]. Group 3: Strategic Direction and Employee Dynamics - In 2022, Hugging Face launched the multilingual AI model BLOOM but has since exited the self-developed model space to control costs [5]. - The company is investing in robotics, datasets, and scientific research AI, having acquired a robotics company, Pollen, last year [5]. - Hugging Face's decentralized AI development philosophy allows employees to work remotely from various locations, although some former employees feel marginalized in strategic decisions [5]. Group 4: Employee Compensation and Culture - Salaries for researchers at Hugging Face typically range from $100,000 to $200,000, which is lower than top tech companies but competitive for startups [5]. - The company allows employees to publicly discuss their work, contrasting with larger tech firms that enforce strict communication protocols [6]. - Hugging Face's culture attracts talent committed to its mission of countering Silicon Valley dominance, as exemplified by its chief ethics scientist, who declined higher-paying offers to maintain her voice [6].
Lux Capital lands $1.5B for its largest fund ever
Yahoo Finance· 2026-01-07 20:09
Core Insights - Lux Capital has closed a $1.5 billion ninth fund, marking the largest fund in the firm's history [1] - Despite a decline in new VC funds raised in the U.S. to a 10-year low in 2025, limited partners continue to invest in Lux Capital [1] Investment Focus - The firm has a history of investing in defense technologies, which have become highly sought after due to recent geopolitical shifts [2] - Lux was an early investor in Anduril, valued at $30.5 billion, and Applied Intuition, valued at $15 billion, both of which have secured significant contracts with the Pentagon [2] AI Investments - Lux has made early investments in AI startups prior to the industry's rapid growth following the launch of ChatGPT [3] - Notable early-stage AI investments include Hugging Face, Runway AI, and MosaicML, the latter of which was acquired by Databricks for $1.3 billion in 2023 [3] Exits and Performance - The firm has achieved significant exits from investments, including Recursion Pharmaceuticals, which went public in 2021, and Auris Health, sold to Johnson & Johnson for up to $6 billion in 2019 [4] - The latest fundraising effort has increased Lux's total assets under management to $7 billion [4]
空间智能终极挑战MMSI-Video-Bench来了,顶级大模型全军覆没
机器之心· 2026-01-05 08:54
Core Insights - The article discusses the importance of spatial understanding capabilities in multimodal large language models (MLLMs) for their transition into real-world applications as "general intelligent assistants" [2] - It highlights the limitations of existing spatial intelligence evaluation benchmarks, which either rely heavily on template generation or focus on specific spatial tasks, making it difficult to comprehensively assess models' spatial understanding and reasoning abilities in real-world scenarios [2] Group 1: Introduction of MMSI-Video-Bench - The Shanghai Artificial Intelligence Laboratory's InternRobotics team has launched a comprehensive and rigorous spatial intelligence video benchmark called MMSI-Video-Bench, designed to challenge current mainstream multimodal models [2][6] - The benchmark aims to evaluate models' spatial perception, reasoning, and decision-making capabilities in complex and dynamic real-world environments [2][7] Group 2: Benchmark Characteristics - MMSI-Video-Bench features a systematic design of question types that assess models' basic spatial perception abilities based on spatiotemporal information [6] - It includes high-level decision-making evaluations and extends task categories to cover complex real-world scenarios, testing models' cross-video reasoning capabilities, memory update abilities, and multi-view integration [6][8] - The benchmark consists of five major task types and 13 subcategories, ensuring a comprehensive evaluation of spatial intelligence [10] Group 3: Challenge and Performance - The benchmark's questions are designed to be highly challenging, with all models tested, including the best-performing Gemini 3 Pro, achieving only a 38% accuracy rate, indicating a significant performance gap of approximately 60% compared to human levels [10][14] - The evaluation reveals that models struggle with spatial construction, motion understanding, planning, prediction, and cross-video reasoning, highlighting critical bottlenecks in their capabilities [14][15] Group 4: Error Analysis - The research team identified five main types of errors affecting model performance: detailed grounding errors, ID mapping errors, latent logical inference errors, prompt alignment errors, and geometric reasoning errors [17][21] - Geometric reasoning errors were found to be the most prevalent, significantly impacting performance, particularly in spatial construction tasks [19][21] Group 5: Future Directions - The article suggests that introducing 3D spatial cues could assist models in understanding spatial relationships better, indicating a potential direction for future research [22][24] - It emphasizes the need for effective design of spatial cues that models can truly understand and utilize, as current failures are attributed to underlying reasoning capabilities rather than a lack of explicit reasoning steps [27]
Minion Skills: Claude Skills的开源实现
量子位· 2025-12-15 08:05
Minion Agent 团队 投稿 量子位 | 公众号 QbitAI 引言 Claude最近推出了一个令人兴奋的特性—— Skills系统 。它让AI Agent能够动态加载专业能力,按需"学习"处理PDF、Excel、PPT等专业 文档的技能。 作为一个开源爱好者,我立刻意识到这个设计的价值,并在Minion框架中实现了完整的开源版本。本文将介绍Skills的设计理念,以及我的开 源实现细节。 Skills解决了什么问题? 在开发AI Agent的过程中,有一个核心矛盾: Context Window的有限性vs能力需求的无限性 传统做法是把所有工具、所有指令都塞进system prompt: System Prompt = 基础指令 + 所有工具描述 + 所有专业知识 = 50K+ tokens = 高延迟 + 高成本 + 低效率 更糟的是,大多数时候用户只需要其中一小部分能力。当用户问"帮我处理这个PDF"时,系统却加载了处理Excel、数据库、代码等所有能力 的上下文。 Skills的核心理念 Minion的开源实现 看到Claude Code的Skills设计后,我决定在Minion框架中实现一个 ...
霍华德·马克斯最新投资备忘录:是泡沫吗?
3 6 Ke· 2025-12-11 03:58
Core Viewpoint - The investment memo by Howard Marks discusses the potential "bubble" in AI investments and emphasizes the need for rational evaluation amidst the current AI technology revolution [1][2]. Group 1: AI Investment Landscape - Oaktree Capital has invested in several data centers, with its parent company Brookfield raising a $10 billion fund for AI infrastructure investments [1]. - Major companies like Oracle, Meta, and Google have issued 30-year bonds for AI investments, with yields only slightly above risk-free rates, raising questions about the wisdom of such long-term debt under technological uncertainty [2][27]. - AI is seen as potentially the greatest transformative technology in history, with significant capital being allocated to it [3][16]. Group 2: Market Behavior and Speculation - The current enthusiasm for AI could lead to a bubble, characterized by excessive optimism and speculative behavior among investors [4][5]. - Historical patterns of bubbles suggest that new technologies often attract irrational exuberance, leading to overvaluation and subsequent losses [7][8]. - The memo highlights the cyclical nature of bubbles, where initial excitement can lead to significant financial losses for investors [5][6]. Group 3: Debt Financing in AI - The use of debt financing in AI infrastructure is increasing, with concerns that this could amplify risks associated with speculative investments [26][28]. - The memo warns that the current phase of speculative financing may lead to unsustainable practices, reminiscent of past financial crises [28][29]. - There is a distinction between healthy and unhealthy debt behaviors in the AI sector, with some companies leveraging debt aggressively without clear revenue prospects [27][28]. Group 4: Uncertainties and Future Outlook - Despite the potential of AI, there is considerable uncertainty regarding its commercialization, the identity of future winners, and the overall market dynamics [18][19]. - The memo raises questions about whether AI will lead to monopolistic markets or remain competitive, impacting profitability for companies involved [19][20]. - Concerns are also expressed about the sustainability of AI-related investments, particularly regarding the lifespan and economic viability of AI infrastructure [30][31].
Cerebras AI Inference Wins Demo of the Year Award at TSMC North America Technology Symposium
Businesswire· 2025-12-05 17:42
Core Insights - Cerebras Systems has been awarded Demo of the Year for its AI Inference technology at the 2025 TSMC North America Technology Symposium, highlighting its significant innovation in the AI infrastructure space [1][3]. Group 1: Technological Achievements - Cerebras has developed a wafer-scale processor, the CS-3, which is 50 times larger than conventional processors, enabling AI workloads to run over 20 times faster than GPUs [2][8]. - The company’s flagship technology, the Wafer Scale Engine 3 (WSE-3), is the largest and fastest AI processor, outperforming the largest GPU by 56 times while consuming less power per compute unit [8]. Group 2: Market Adoption and Partnerships - Cerebras AI Inference is utilized in demanding environments globally, available through major cloud platforms such as AWS, IBM, and Hugging Face, and is adopted by sectors including healthcare, biotech, finance, and design [4][6]. - The technology supports critical national scientific research at U.S. Department of Energy laboratories and the Department of Defense, showcasing its versatility and reliability in high-stakes applications [4]. Group 3: Performance Metrics - Cerebras is recognized as the fastest platform for AI coding, generating code over 20 times faster than competing solutions, and consistently achieving the fastest inference speeds verified by independent benchmarks [5][8]. - The company serves trillions of tokens monthly across its cloud and on-premises deployments, indicating robust demand and operational scale [6].
五年,终于等来Transformers v5
自动驾驶之心· 2025-12-04 03:03
Core Insights - The article discusses the release of Transformers v5.0.0rc0, marking a significant evolution in the AI infrastructure library after a five-year development cycle from v4 to v5 [3] - The update highlights the growth of the Transformers library, with daily downloads increasing from 20,000 to over 3 million and total installations surpassing 1.2 billion since the v4 release in November 2020 [3] - The new version focuses on four key dimensions: simplicity, transition from fine-tuning to pre-training, interoperability with high-performance inference engines, and making quantization a core feature [3] Simplification - The primary focus of the team is on simplicity, aiming for a clean and clear integration of models, which will enhance standardization, versatility, and community support [5][6] - The library has adopted a modular design approach, facilitating easier maintenance and faster integration, while promoting collaboration within the community [10] Model Updates - Transformers serves as a toolbox for model architectures, with the goal of including all the latest models and becoming the trusted source for model definitions [7] - Over the past five years, an average of 1-3 new models has been added weekly [8] Model Conversion Tools - Hugging Face is developing tools to identify similarities between new models and existing architectures, aiming to automate the model conversion process into the Transformers format [13][14] Training Enhancements - The v5 version emphasizes support for pre-training, with redesigned model initialization and broader compatibility with optimization operators [20] - Hugging Face continues to collaborate with fine-tuning tools in the Python ecosystem and is ensuring compatibility with tools in the JAX ecosystem [21] Inference Improvements - Inference is a key area of optimization in v5, with updates including dedicated kernels, cleaner default settings, new APIs, and enhanced support for inference engines [22][25] - The goal is not to replace specialized inference engines but to achieve compatibility with them [25] Local Deployment - The team collaborates with popular inference engines to ensure that models added to Transformers are immediately available and can leverage the advantages of these engines [27] - Hugging Face is also working on local inference capabilities, allowing models to run directly on devices, with expanding support for multimodal models [28] Quantization - Quantization is becoming a standard in modern model development, with many state-of-the-art models being released in low-precision formats such as 8-bit and 4-bit [29]
五年,终于等来Transformers v5
具身智能之心· 2025-12-03 03:47
编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 刚刚,Transformers v5 发布首个 RC(候选) 版本 v5.0.0rc0。 GitHub:https://github.com/huggingface/transformers/releases/tag/v5.0.0rc0 这次更新标志着这一全球最流行的 AI 基础设施库,正式跨越了从 v4 到 v5 长达 五年 的技术周期。 作为 Hugging Face 最核心的开源项目,自 2020 年 11 月 v4 版本发布以来,Transformers 的日下载量已从当时的 2 万次激增至如今的超过 300 万次 ,总安装量突破 12 亿次 。 它定义了业界如何使用模型,支持的架构也从最初的 40 个扩展至超过 400 个 ,涵盖了文本、视觉、音频及多模态领域,社区贡献的模型权重更是超过 75 万个 , 涵盖了文本、视觉、音频及多模态领域。 官方表示,在人工智能领域,「重塑」是保持长 ...