机器之心
Search documents
人人都能炼专属Agent,上海交大开源端侧Agent全栈工具链,真实场景性能超GPT-5!
机器之心· 2025-09-10 07:31
打开手机,让 AI Agent 自动帮你完成订外卖、订酒店、网上购物的琐碎任务,这正成为智能手机交互的新范式。 就在刚刚,这一局面迎来了新的破局者。 来自 上海交通大学 IPADS 实验室 的团队 ,正式开源了一套名为 MobiAgent 的移动端智能体 "全家桶"。 APP:https://github.com/IPADS-SAI/MobiAgent/releases/download/v1.0/Mobiagent.apk 一个能自主处理大部分日常任务的个人专属智能体,正在从科幻走进现实。 然而,通往 "解放双手" 的最后一公里却并不好走。如何高效地训练和在手机端部署 Agent 模型,长期以来似乎都是少数大厂的 "自留地"。从高质量操作数据的获 Agent 养成全攻略:三步走 要让 AI 学会玩手机,首先得让它看懂人是怎么操作的。MobiAgent 的第一大核心,就是贡献了一套 AI 辅助的 敏捷数据收集 "流水 线 " 。 过去,给 AI 准备 "教材"(标注数据)又贵又慢。现在,MobiAgent 用一个轻量级小工具,就能记录下人类在手机上的所有点击、滑动、输入等操作轨迹。对于一 些简单的任务,这一录 ...
AI胡说八道这事,终于有人管了?
机器之心· 2025-09-10 04:05
编辑:+0、张倩 想象一下,如果 ChatGPT 等 AI 大模型在生成的时候,能把自己不确定的地方都标记出来,你会不会对它们生成的答案放心很多? 机器之心报道 上周末, OpenAI 发的一篇论文引爆了社区 。这篇论文系统性地揭示了幻觉的根源,指出问题出在奖励上 —— 标准的训练和评估程序更倾向于对猜测进行奖励, 而不是在模型勇于承认不确定时给予奖励。可能就是因为意识到了这个问题,并找出了针对性的解法,GPT-5 的幻觉率大幅降低。 随着 AI 大模型在医疗咨询、法律建议等高风险领域的应用不断深入,幻觉问题会变得越来越棘手,因此不少研究者都在往这一方向发力。除了像 OpenAI 那样寻 找幻觉原因,还有不少人在研究幻觉检测技术。然而,现有的幻觉检测技术在实际应用中面临瓶颈,通常仅适用于简短的事实性查询,或需要借助昂贵的外部资 源进行验证。 针对这一挑战,来自苏黎世联邦理工学院(ETH)和 MATS 的一项新研究提出了一种低成本、可扩展的检测方法,能够 实时识别长篇内容中的「幻觉 token」 , 并成功应用于高达 700 亿(70B)参数的大型模型。 论文标题:Real-Time Detection of ...
AI应用元年,这场标杆赛事见证了中国创新速度与野心
机器之心· 2025-09-10 04:05
机器之心原创 编辑:吴昕 一场关于未来金融智能的集体预演,见证了创业者们的冲刺,也折射出一个行业的进化。 2025 年的 AI ,正在上演「双线长跑」。 一端是大模型底层的持续进化,远未触顶;另一端是场景应用集中爆发。 来自 a16z 最新发布的全球百强 GenAI 应用榜单 ,释放出一个清晰信号,在「 AI 如何改造行业」应 用上,中国玩家已展现出全球领先优势。 与此同时,国务院印发的「人工智能 + 」行动计划又添了一把柴。 AI 的赋能范围,正从新质生产力的 试点,扩展到全社会,被视作未来现代化的核心引擎。 这股脉动,在 AFAC2025 金融智能创新大赛 上体现得淋漓尽致。作为连续举办三年的金融智能标杆 赛事,它已成为海内外 AI 创业团队的聚合地。在为期三个月的赛程中, 11 支队伍从初创组脱颖而出 —— 获奖方案直击真实金融痛点,覆盖底层技术突破与复杂系统工程,落地性极强,跨界创新尤为显著。 11 支获奖团队的项目方向、技术亮点和应用场景,大都直击真实金融痛点,落地性极强,「跨界」创新明显。 中国的应用落地速度是全球领先的,另一位评委、 xcube.co 首席幕僚长兼董事、新加坡金融科技节和 GFT ...
苹果发布会:耳机测心率、手表听音乐、iPhone Air超级薄
机器之心· 2025-09-09 23:21
机器之心报道 编辑:杨文、Panda 北京时间 9 月 10 日凌晨 1 点,伴随着 Tim Cook 的一声「Good Morning」,这场主题为「 Awe Dropping 」的 2025 苹果秋季新品发布会正式拉开帷幕。 发布会持续 75 分钟,Air Pods、Apple Watch 和 iPhone17 系列轮番上阵,其中印象最深刻的卖点就是:耳机测心率、手表听音乐、iPhone Air 超级薄。 今年的 iPhone 17 系列总共分为四款机型,价格如下: iPhone 17 起售价 799 美元 / 5999 元; 以上机型都将于 9 月 12 日 星期五开始预订,并计划于下周五( 9 月 19 日 )发货。 至于大众瞩目的 AI 功能,发布会上介绍的可谓是少之又少。即使是提到的大多数面向消费者的 AI 功能,比如视觉智能和 iMessage、FaceTime 中的实时翻译,早 在今年 6 月的 WWDC 大会上就已经展示过了,而且这些功能也并不是苹果的创新,谷歌和三星等竞争对手早在一年前就推出了类似的功能。 更有意思的是,发布会开始前半小时,苹果的股价就先跌为敬,发布会后股价下跌 1.48% ...
从第一性原理出发的RAG推理新范式来了,蚂蚁DIVER登顶权威基准
机器之心· 2025-09-09 11:46
在当前由大语言模型(LLM)驱动的技术范式中,检索增强生成(RAG)已成为提升模型知识能力与缓解「幻觉」的核心技术。然而,现有 RAG 系统在面对需多 步逻辑推理任务时仍存在显著局限,具体挑战如下: 为建立严格的评估体系,学术界提出了 BRIGHT—— 首个面向推理密集型检索的权威测试集。该基准涵盖了源自经济学、心理学、数学及编程等多个知识密集型 领域的真实查询。这些查询的共性在于其答案无法通过传统的直接检索显式获得,使得很多 RAG 系统失效。而 BRIGHT 必须通过多步推理构建证据链,也就是所 谓的「第一性原理」, 从 「根源」 推导,而非 「类比」来解决问题。 论文标题: DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval arXiv 地址:https://arxiv.org/pdf/2508.07995 代码与模型开源地址: 表面相关性 (Surface Relevance):基于 TF-IDF/BM25 等传统方法过度依赖词汇重叠度,倾向于召回与查询共享关键词的文档,导致检索结果停留于浅层文本 匹配 ...
文心新出的推理大模型,给了我们信心
机器之心· 2025-09-09 11:46
机器之心报道 机器之心编辑部 当下的大语言模型,不怕它搞不定,就怕它胡说八道:有「幻觉」存在,我们经常会下意识地不信任 AI 输出的结果。就在上周,OpenAI 的论文《Why Language Models Hallucinate》广为流传。研究人员指出,要想消除幻觉,需要修正模型训练时的评分机制并开发全新的技术。 不过 AI 领域里,技术的发展速度一直比想象得快,就像是对 OpenAI 研究的呼应,今天上午 WAVE SUMMIT 深度学习开发者大会 2025 上,百度发布的 新模型就把「可信度」提升了一大截,除了更准确的事实性,更有指令遵循、智能体等能力的显著提升。 今天发布的是 文心大模型 X1.1 深度思考模型,它是百度在 4 月份发布的旗舰模型 X1 的升级版,发布即上线,所有人都可以免费体验 。同时该模型通过 百度智能云千帆平台向企业客户与开发者开放使用。 升级后的模型主攻事实性、指令遵循以及智能体、工具调用能力,带来了综合能力的显著提升。用一组数据说话,相较于文心 X1,X1.1 的事实性提升 34.8%,指令遵循提升 12.5%,智能体提升 9.6%。 这意味着它提供信息时更加可靠、执行任务 ...
SFT远不如RL?永不过时的剃刀原则打开「终身学习」大模型训练的大门
机器之心· 2025-09-09 11:46
Core Viewpoint - The article discusses the challenges and advancements in large models, particularly focusing on the phenomenon of catastrophic forgetting and the advantages of reinforcement learning (RL) over supervised fine-tuning (SFT) in mitigating this issue [1][3][29]. Group 1: Large Models and Their Challenges - The era of large models has arrived, becoming a core component of intelligent infrastructure supporting various applications such as language processing, visual analysis, and robotics [1]. - Most deployed large models are "static" and lack the ability for dynamic learning and self-improvement, which is essential for achieving more general artificial intelligence (AGI) [2][3]. - Catastrophic forgetting occurs when models lose previously learned skills while learning new tasks, posing a significant challenge for long-term learning agents [3]. Group 2: Research Insights on Catastrophic Forgetting - Researchers have proposed various methods to address catastrophic forgetting, including regularization, experience replay, and parameter tuning [5]. - A recent study from MIT's Improbable AI Lab revealed fundamental patterns and training strategies related to forgetting in large models, gaining significant attention [6][7]. Group 3: Findings from the Study - The study compared two common post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL), finding that RL is less prone to forgetting [8][29]. - A new principle called the "forgetting law" was introduced, indicating that the KL divergence between the fine-tuned strategy and the baseline strategy is a key predictor of forgetting [10][30]. - The research demonstrated that RL maintains better retention of prior knowledge while learning new tasks compared to SFT, which often sacrifices old knowledge for new performance [15][29]. Group 4: Mechanisms and Theoretical Contributions - The study identified that the online nature of RL contributes to its KL divergence minimization, which helps retain prior knowledge [21][30]. - The authors provided a theoretical basis for RL's KL-minimizing behavior, explaining that RL naturally prefers solutions closer to the original model [24][30]. - The findings suggest that future training methods should aim to minimize KL divergence to achieve continuous learning without forgetting [31][32].
DPad: 扩散大语言模型的中庸之道,杜克大学陈怡然团队免训推理加速61倍
机器之心· 2025-09-09 08:56
论文作者团队 :来自杜克大学 CEI Center,由实习生陈欣骅、黄思韬及郭聪博士共同完成,指导教师为李 海教授、陈怡然教授。 扩散大语言模型(dLLMs)凭借并行解码与独特的全局规划能力,有望解决自回归(AR)大模型的效率瓶 瓶颈和规划能力缺陷。但其「全局规划」能力依赖于其双向注意力对所有后文的关注,这带来了严重的计 算冗余,从而导致现有开源模型的潜力远远未被释放。 当前的 dLLM 存在「路线之争」:一是保留全局规划能力但推理效率极低的「全局双向注意」(如 LLaDA),二是追求速度却牺牲规划能力的「块内双向注意」(如 Block Diffusion)。如何在这两条路线之 间调和折中,让模型既能「着眼全局」,又能加速推理,已成为学界日益关注的问题。 针对以上问题,杜克大学陈怡然团队另辟蹊径,揭示了 dLLM 中实现全局规划的「草稿纸机制」,并发现 其存在高度冗余。据此,他们提出免训练方法 DPad(Diffusion Scratchpad),通过先验地丢弃大量无效后 缀 token,既极大地降低了计算量,又保留了核心规划能力,尝试在两条路线中走出一条「中间路线」。该 方法与现有优化技术结合后,在几乎无损 ...
硅谷也996实锤了?AI的火,烧掉了硅谷的周末
机器之心· 2025-09-09 08:56
Core Viewpoint - The "996" work culture, initially seen as a phenomenon unique to Chinese tech companies, is increasingly becoming a reality in Silicon Valley, with evidence of longer working hours and changes in employee consumption patterns [2][3][9]. Group 1: Evidence of 996 in Silicon Valley - A blog post by Ara Kharazian, an economist at fintech company Ramp, highlights the increase in Saturday work hours among employees in San Francisco, reflected in their consumption trends [3][7]. - Data from Ramp shows a significant increase in dining and takeout spending on Saturdays in 2025 compared to 2024, indicating that employees are working longer hours on weekends [7][8]. - This trend is unique to San Francisco, as other major tech hubs do not show a similar increase in Saturday spending, with New York's increase being only a quarter of that in San Francisco [8][9]. Group 2: Broader Implications and Reactions - The increase in Saturday spending is not limited to tech companies but is observed across various industries in San Francisco, suggesting a widespread adoption of longer working hours [9]. - Some industry leaders express concerns that forcing employees to work long hours can lead to talent attrition, ultimately harming company progress [18][20]. - The phenomenon of "996" is contrasted with a more relaxed work culture in Europe, where the concept of "996" humorously refers to taking significant time off rather than long working hours [25][26].
Altman亲自发博客点赞,这两大杰出人才是谁?
机器之心· 2025-09-09 06:45
Core Viewpoint - OpenAI's recent advancements in AI technology, particularly with ChatGPT, are attributed to the contributions of two key researchers, Jakub Pachocki and Szymon Sidor, who have effectively combined cutting-edge research with engineering practices to solve numerous challenges [1][3][4]. Group 1: Contributions of Jakub Pachocki - Jakub Pachocki is recognized as a pivotal figure at OpenAI, serving as the Chief Scientist and leading significant projects such as the development and pre-training of GPT-4 [4][8]. - He played a crucial role in the OpenAI Five project, where AI defeated human champions in the game Dota 2, which bolstered confidence in the potential of large-scale reinforcement learning (RL) [4][8]. - Pachocki's academic background includes a focus on high-dimensional convex optimization, which is closely related to the training of modern neural networks [6][8]. Group 2: Contributions of Szymon Sidor - Szymon Sidor, who graduated from MIT, has made significant contributions to various core projects at OpenAI, including the development of large-scale RL systems and advancements in robotics [12][13]. - His early research explored the intersection of reinforcement learning and natural language processing (NLP), laying the groundwork for techniques used in aligning ChatGPT and training reasoning models [12][14]. - Sidor's involvement in the OpenAI Five project and his contributions to the GPT-4 technical report highlight his integral role in the company's advancements [13][14]. Group 3: Internal Dynamics and Leadership Changes - Following the unexpected dismissal of CEO Sam Altman, both Jakub Pachocki and Szymon Sidor, along with other key personnel, resigned in protest, which triggered a significant employee backlash [16][17]. - The internal crisis led to a restructuring of OpenAI's leadership, with Pachocki being appointed as the new Chief Scientist after Altman's return [17].