Workflow
机器之心
icon
Search documents
ACL 2025 | 大模型乱试错、盲调用?KnowSelf让智能体有「知识边界感知」能力
机器之心· 2025-05-21 08:04
在 AI 领域,大模型智能体的发展日新月异。我们今天要介绍的这篇 ACL 2025 论文——《Agentic Knowledgeable Self-awareness》,聚焦于如何提 升智能体的「知识边界感知」能力,使其在复杂任务规划中更加得心应手,为智能体的可靠应用提供了新思路。 30 秒速读版本 KnowSelf 聚焦于大模型智能体在决策过程中所面临的「知识边界感知」问题。受人类决策机制启发,本文指出智能体应具备三类行为模式的自主决策能 力:快速反应(快思考)、深度推理(慢思考),以及主动调用外部工具(本文以外部知识增强为例)。 KnowSelf 通过学习自身的知识边界,使智能体能在不同情境下自主判断是否具备足够知识进行生成和推理,以减少无效试错与知识滥用。实验表明, KnowSelf 可提升智能体的知识调用准确率、任务规划效率和跨任务泛化能力。 研究背景:智能体规划的困境 大模型智能体在诸多领域展现出巨大潜力,但现有智能体规划方法存在弊端。传统方法多采用「盲目灌输」模式,将标准轨迹、外部反馈和领域知识无差别 地注入智能体模型,完全忽视了人类决策过程中至关重要的「自我认知」原则。 这种「无脑式」灌输导致智 ...
策略学习助力LLM推理效率:MIT与谷歌团队提出异步并行生成新范式
机器之心· 2025-05-21 04:00
金天, 麻省理工学院(MIT)计算机科学与人工智能实验室(CSAIL)博士五年级学生,师从 Michael Carbin 和 Jonathan Ragan-Kelley。他主要研究 机器学习与编程系统的结合。此前曾在 IBM Research 主导实现深度神经网络在 IBM 主机上的推理部署。本科毕业于 Haverford College,获计算机科学 与数学双学位。 鄭鈺熹, 麻省理工学院 CSAIL 博士三年级学生,师从 Michael Carbin。她的研究方向为编程语言与机器学习的交叉领域。 大语言模型(LLM)的生成范式正在从传统的「单人书写」向「分身协作」转变。传统自回归解码按顺序生成内容,而新兴的异步生成范式通过识别语义独 立的内容块,实现并行生成。 如图所示,传统方法(下)按顺序生成所有内容,而异步生成(上)同时处理多个互不依赖的内容块。对比顺序生成,异步生成在 AlpacaEval 长度控制评 测中实现 1.21-1.93× 的几何平均提速 ,对应生成质量变化(胜率)为 +2.2% 至 -7.1%。 MIT 与谷歌研究团队在最新研究 PASTA(PArallel STructure Anno ...
何恺明团队又发新作: MeanFlow单步图像生成SOTA,提升达50%
机器之心· 2025-05-21 04:00
Core Viewpoint - The article discusses a new generative modeling framework called MeanFlow, which significantly improves existing flow matching methods by introducing the concept of average velocity, achieving a FID score of 3.43 on the ImageNet 256×256 dataset without the need for pre-training, distillation, or curriculum learning [3][5][7]. Methodology - MeanFlow introduces a new ground-truth field representing average velocity instead of the commonly used instantaneous velocity in flow matching [3][8]. - The average velocity is defined as the displacement over a time interval, and the relationship between average and instantaneous velocity is derived to guide network training [9][10]. Performance Results - MeanFlow demonstrates strong performance in one-step generative modeling, achieving a FID score of 3.43 with only 1-NFE, which is a 50% improvement over the best previous methods [5][16]. - In 2-NFE generation, MeanFlow achieves a FID score of 2.20, comparable to leading multi-step diffusion/flow models [18]. Comparative Analysis - The article provides a comparative analysis of MeanFlow against previous single-step diffusion/flow models, showing that MeanFlow outperforms them significantly, with a FID score of 3.43 compared to 7.77 for IMM [16][17]. - The results indicate that the proposed method effectively narrows the gap between single-step and multi-step diffusion/flow models [18].
飞书一个聊天框,激活了机器之心编辑部的知识资产
机器之心· 2025-05-21 04:00
Core Viewpoint - The article emphasizes the importance of AI in understanding and managing enterprise knowledge, highlighting the limitations of general AI models in addressing specific organizational needs and the advantages of using specialized tools like Feishu Knowledge Q&A for efficient information retrieval and management [1][56]. Group 1: Feishu Knowledge Q&A Overview - Feishu Knowledge Q&A is an AI tool that aggregates and comprehends all enterprise and personal information, providing accurate feedback based on messages, documents, and knowledge bases within the Feishu ecosystem [2][6]. - The tool features rapid integration of new information, achieving updates in seconds, which significantly enhances user productivity [2][6]. Group 2: AI Capabilities and Security - Feishu Knowledge Q&A utilizes advanced AI capabilities to infer and generate content based on retrieved information, serving as a valuable work assistant [3][6]. - The tool implements a granular permission management system, ensuring that knowledge Q&A access aligns with individual user permissions, thus maintaining data security [3][38]. - Feishu's DeepSeek-R1 model is independently deployed, ensuring isolation from other services, which enhances user experience while safeguarding enterprise data [3][38]. Group 3: Knowledge Management and Retrieval - The tool addresses the challenges of fragmented enterprise knowledge by enabling fuzzy search capabilities, allowing users to find relevant information quickly with vague queries [8][9]. - Feishu Knowledge Q&A can provide structured answers and contextually relevant information, enhancing understanding of complex topics [17][22]. - It can also streamline processes by presenting relevant policies and procedures, making it easier for employees to navigate organizational complexities [27][31]. Group 4: Business-Oriented Content Generation - Beyond information retrieval, Feishu Knowledge Q&A can generate business reports, work plans, and charts based on internal knowledge [31][34]. - The tool demonstrated its ability to create structured outputs, such as ingredient lists for team events, showcasing its contextual understanding and content generation capabilities [34][36]. Group 5: Advanced Features and Flexibility - The system supports multiple model switching, allowing for a tailored AI experience that meets diverse organizational needs [48]. - It integrates both internal knowledge bases and real-time external information, enhancing the breadth and timeliness of responses [48]. - The automatic source tracing feature helps mitigate the "hallucination" problem common in large language models, providing reliable and verifiable answers [47][48].
大模型全面爆发,所有榜一都是Gemini!谷歌一夜站到了台前
机器之心· 2025-05-21 00:33
Core Viewpoint - Google is reaffirming its leadership in the AI industry through significant advancements and new releases showcased at the Google I/O 2025 developer conference, emphasizing the importance of AI in its future strategy [2][61]. Group 1: AI Model Developments - The Gemini 2.5 Pro model has shown outstanding performance in academic benchmarks and is now a leading model in the WebDev Arena and LMArena rankings [8][12]. - New features introduced for Gemini 2.5 Pro and 2.5 Flash include native audio output for more natural conversations, advanced security measures, and enhanced computational capabilities [9][15]. - The Gemini Diffusion model utilizes diffusion technology to improve inference speed and control, achieving a token generation speed of 10,095 tokens every 12 seconds, which is five times faster than previous models [16][18]. Group 2: Programming Tools Enhancements - Google introduced Jules, an asynchronous coding assistant that integrates with existing codebases, allowing users to focus on other tasks while it performs coding operations [21]. - The Gemini Code Assist has been upgraded to support more customization options and now offers a context window of 2 million tokens for complex tasks [23]. - Statistics show that Gemini Code Assist can increase the success rate of developers completing common tasks by 2.5 times [24]. Group 3: Video and Image Generation Models - The new video generation model Veo 3 can generate videos with audio, enhancing the quality of video content creation [29][30]. - Imagen 4 offers exceptional detail and clarity in image generation, supporting various aspect ratios and high resolutions up to 2k [35]. Group 4: Search and Shopping Innovations - Google has upgraded its AI Overviews feature in search, now covering over 200 countries and supporting more than 40 languages, improving user satisfaction and search frequency [47][48]. - A new AI shopping experience combines Gemini capabilities with Shopping Graph, allowing users to virtually try on clothing by uploading photos [56][59]. Group 5: Future Vision and Strategic Direction - Google aims to expand Gemini into a universal AI assistant capable of managing daily tasks and enhancing user productivity, with ongoing innovations in video understanding and memory features [19][60]. - The company is positioning itself to lead in the AI-driven era, showcasing its commitment to shaping a more intelligent and interconnected world through advanced AI applications [61].
九成以上模型止步白银段位,只有3个铂金!通用AI下半场评测标准来了
机器之心· 2025-05-21 00:33
Core Viewpoint - The development of artificial intelligence (AI) is entering a new phase where the focus shifts from solving problems to defining them, emphasizing the importance of evaluation standards over training techniques [2][3]. Group 1: Evaluation Framework - A new evaluation framework called "General-Level" has been proposed to assess the capabilities of multimodal large language models (MLLMs), aiming to measure their progress towards artificial general intelligence (AGI) [3][6]. - The General-Level framework categorizes MLLMs into five levels based on their ability to exhibit synergy across different tasks and modalities, with the highest level representing true multimodal intelligence [11][15]. - The framework highlights the need for a unified standard to evaluate "generalist intelligence," addressing the current fragmentation in assessment methods [6][9]. Group 2: General-Bench Testing Set - The General-Bench is a comprehensive multimodal testing set consisting of 700 tasks and approximately 325,800 questions, designed to rigorously evaluate MLLMs across various modalities [19][21]. - This testing set emphasizes open-ended responses and content generation, moving beyond traditional multiple-choice formats to assess models' creative capabilities [24][25]. - The design of General-Bench includes cross-modal tasks that require models to integrate information from different modalities, simulating real-world challenges [24][25]. Group 3: Model Performance Insights - Initial testing results reveal that many leading models, including GPT-4V, exhibit significant weaknesses, particularly in video and audio tasks, indicating a lack of comprehensive multimodal capabilities [23][25]. - Approximately 90% of tested models only reached Level-2 (Silver) in the General-Level framework, demonstrating limited synergy and generalization across tasks [27][28]. - No models have yet achieved Level-5 (King) status, highlighting the ongoing challenges in achieving true multimodal intelligence and the need for further advancements [28][29]. Group 4: Community Response and Future Outlook - The introduction of General-Level and General-Bench has garnered positive feedback from both academic and industrial communities, with recognition at major conferences [35][36]. - The open-source nature of the project encourages collaboration and continuous improvement of the evaluation framework, fostering a community-driven approach to AI assessment [36][39]. - The new evaluation paradigm is expected to accelerate progress towards AGI by providing clear benchmarks and encouraging a focus on comprehensive model capabilities rather than isolated performance metrics [41][42].
20万美元奖金等你来拿!首届WBCD 2025双臂机器人挑战赛全球启幕
机器之心· 2025-05-20 07:33
机器之心发布 机器之心编辑部 1. WBCD 2025 介绍 5 月 19-23 日,全球机器人年度盛事 - IEEE 国际机器人与自动化会议(ICRA 2025)将在美国亚特兰大举行,届时,第一届 「探索机器人能力边界双臂机器人挑战 赛(What Bimanual Can Do,简称 WBCD)」 决赛也会在 ICRA 2025 现场拉开帷幕。决赛地点: ICRA Exhibit Hall Booth C08。 作为 ICRA 官方合作的赛事,WBCD 以「真实场景验证」为核心定位,从机器人公司的实际需求出发,设置三大前沿赛题,重点关注 双臂机器人自主感知、预测 规划与精细操纵等落地性能 。 赛事具体设置为: 本届 WBCD 共吸引了 全球 88 支队伍 报名。经过多轮筛选,最终 16 支高校和企业团队入围决赛 。他们来自加州大学伯克利分校、卡内基梅隆大学、普渡大学、 西北大学、佐治亚理工学院、苏黎世联邦理工学院、洛桑联邦理工学院、 梨花女子大学、上海科技大学,上海交通大学、以及 IO.ai、Frodobots、DexForce、TSC Consulting 等企业的代表队。 1. 物流包装挑战 (Pack ...
将对话界面直接引入Web,微软开源NLWeb,实现ChatGPT级别搜索
机器之心· 2025-05-20 07:33
为网站构建会话界面是很困难的,NLWeb 试图让网站更容易做到这一点。 微软 Build 2025 开发者大会已经拉开帷幕。 机器之心报道 机器之心编辑部 简单来说,MCP 与 NLWeb 的关系就像 Http 与 HTML 的关系一样。 作为一个开放协议和相关开源工具的集合,NLWeb 主要目标是为 AI Web 构建一个基础层 —— 就像 HTML 彻底革新了文档共享一样。 一直以来,开发能够用自然语言与应用程序乃至整个计算机世界进行自由对话的智能体一直是 AI 革命的核 心。 然而,当前大多数新型交互都被 ChatGPT、Claude 甚至必应等产品垄断。这些机器人汲取海量知识却不产 生实质价值。 而 NLWeb 相比传统搜索要便宜得多,且使用起来非常方便,仅需几行代码、任选 AI 模型(OpenAI 、 DeepSeek、Gemini、Anthropic、Inception 等),并输入特定数据,NLWeb 就能为用户提供对话式交互界面 (即一个文本框加提交按钮)。 其中,一个名为 NLWeb (Natural Language Web)的开源项目得到大家广泛关注。 该项目旨在简化网站自然语言交互界面 ...
75万元奖金池+心动offer,启元实验室2025重磅赛事来袭,三大赛道,等你来战!
机器之心· 2025-05-20 04:58
机器之心发布 机器之心编辑部 为推动智能算法从理论创新走向实际落地,启元实验室正式启动 「启智杯 」算法大赛。本届大赛聚焦 「卫星遥感图像鲁棒实例分割 」、 「面向嵌入式平台的无人机对地目标检测 」以及 「面向多模态大 模型的对抗 」三大方向,围绕鲁棒感知、轻量部署与对抗防御等核心能力,旨在引导技术创新精准对 接应用场景,加速智能算法的落地转化与规模化推广。大赛设立总额 75 万元的奖金池, 面向国内各 研究机构、 企事业单位 及其他相关组织开放。 在遥感领域,基于深度学习的 实例 分割方法已展现出 显著优势, 通过构建 时空联合特征表达体系, 模型精度和适应性不断提升 , 遥感解译 正在 由传统 方法 向自动化、智能化 方向 演进。 然而, 在 实际应用中,受限于复杂地表覆盖、多视角成像差异及云雾遮挡等因素,现有算法在多目标精细分割、 跨场景泛化和鲁棒性方面仍存在明显不足,亟待突破关键瓶颈。 在低空场景中,无人机与智能检测算法的融合催生 出 新型 目标识别 范式。依托空中视角 优势 与高机 动性,搭载检测模型的无人机 能够 实现高效、灵活的数据采集与实时分析。 但在 实际部署中仍面临 挑战:图像中目标密集、 ...
代码、多模态检索全面登顶SOTA!智源BGE向量模型三连击,并全面开放
机器之心· 2025-05-20 04:58
机器之心发布 机器之心编辑部 检索增强技术在代码及多模态场景中的发挥着重要作用,而向量模型是检索增强体系中的重要组成部分。针对这一需求,近日,智源研究院联合多所高校研发了 三款向量模型,包括代码向量模型 BGE-Code-v1,多模态向量模型 BGE-VL-v1.5 以及视觉化文档向量模型 BGE-VL-Screenshot。这些模型取得了代码及多模态检索 的最佳效果,并以较大优势登顶 CoIR、Code-RAG、MMEB、MVRB 等领域内主要测试基准。 BGE 自 2023 年 8 月发布以来,已成为中国首个登顶 Hugging Face 榜首的国产 AI 模型以及 2023 年所有发布模型的全球下载量冠军。 目前,BGE-Code-v1、BGE-VL-v1.5、BGE-VL-Screenshot 三款模型已向社区全面开放,为相关技术研究与产业应用提供助力。 BGE-Code-v1: BGE-VL-v1.5: BGE-VL-Screenshot: 由智源研究院主导研发的通用向量模型系列 BGE,旨在为各类数据提供高效一站式向量表征与语义检索方案,已推出覆盖中英文、多语言检索及重排模型等多个 版本,持续刷新 ...