机器之心

Search documents
「世界模型」也被泼冷水了?邢波等人揭开五大「硬伤」,提出新范式
机器之心· 2025-07-09 07:10
机器之心报道 编辑:泽南、+0 现在的世界模型,值得批判。 我们知道,大语言模型(LLM)是通过预测对话的下一个单词的形式产生输出的。由此产生的对话、推理甚至创作能力已经接近人类智力水平。 但目前看起来,ChatGPT 等大模型与真正的 AGI 还有肉眼可见的差距。如果我们能够完美地模拟环境中每一个可能的未来,是否就可以创造出强大的 AI 了?回想 一下人类:与 ChatGPT 不同,人类的能力组成有具体技能、深度复杂能力的区分。 模拟推理的案例:一个人(可能是自私的)通过心理模拟多个可能结果来帮助一个哭泣的人。 人类可以执行广泛的复杂任务,所有这些任务都基于相同的人类大脑认知架构。是否存在一个人工智能系统也能完成所有这些任务呢? 论文:Critiques of World Models 论文链接:https://arxiv.org/abs/2507.05169 研究人员指出了构建、训练世界模型的五个重点方面:1)识别并准备包含目标世界信息的训练数据;2)采用一种通用表征空间来表示潜在世界状态,其含义可 能比直接观察到的数据更为丰富;3)设计能够有效对表征进行推理的架构;4)选择能正确指导模型训练的目标函数; ...
百万奖金 + 顶配资源!AI 创业者征集令!
机器之心· 2025-07-09 04:23
聚焦 AI 技术在技术创新、产业应用中的跨界融合 推动 AI 模型从实验室走向真实场景,构建 AI 生态 文末扫码报名,期待您的参与和关注! " AI 赋能未来:创新与应用的无限可能" 复曜青溪 智链长三角 ——上海银行杯 AI 创新创业大赛正式启动! 这里不仅是技术的竞技场 更是梦想的孵化器 ...
刚刚,为对抗哥大退学生开发的AI作弊器,哥大学生造了个AI照妖镜
机器之心· 2025-07-09 04:23
Core Viewpoint - The article discusses the emergence of Cluely, an AI tool designed to assist users in meetings by capturing audio and acting on their behalf, which has sparked controversy and led to the development of a counter tool called Truely to detect its use [1][2][3]. Group 1: Cluely Overview - Cluely is described as an AI desktop assistant that can listen and record audio during meetings, effectively allowing it to participate in discussions on behalf of the user [1]. - The tool has gained significant attention, with promotional claims suggesting it could disrupt nine industries, leading to over 2.93 million views on related tweets [2]. Group 2: Truely Development - Truely, developed by students from Columbia University, aims to detect whether a user is interacting with a real person or a Cluely-powered assistant during video calls [4][5]. - The detection process involves sending an application to the other party, which monitors for the presence of Cluely on their device, alerting the user if detected [7]. Group 3: Truely Features - Truely includes features such as real-time process monitoring, automatic joining of Zoom meetings as a bot, and sending alerts in chat when suspicious processes are detected [9]. - The application requires the other party to install software, which raises concerns about security and the complexity of the process [8]. Group 4: Legal and Ethical Concerns - Cluely has taken legal action against a security researcher for sharing reverse-engineered prompts related to its software, raising ethical questions about its approach to security research [13][14]. - The researcher expressed concerns about the implications of legal threats against security researchers and called for Cluely to be more open to collaboration [15].
OpenAI反挖四位特斯拉、xAI、Meta高级工程师,目标星际之门
机器之心· 2025-07-09 04:23
Core Viewpoint - The article discusses the intense competition for AI talent between major companies like OpenAI and Meta, highlighting recent talent acquisitions and the implications for the industry [1][2][8]. Group 1: Talent Acquisition - OpenAI has recently hired four prominent engineers from competitors, including David Lau, former software engineering VP at Tesla, and others from xAI and Meta [3][5][6]. - Meta has aggressively recruited at least seven employees from OpenAI, offering high salaries and substantial computational resources to support their research [8][18]. - The competition for talent has escalated, with OpenAI's Chief Research Officer Mark Chen expressing a strong commitment to countering Meta's recruitment efforts [19]. Group 2: Strategic Initiatives - OpenAI's expansion team, which includes the new hires, is focused on building AI infrastructure, including a significant joint project named "Stargate," aimed at developing a supercomputer with a projected cost of $115 billion [7]. - The new hires emphasize the importance of infrastructure in bridging research and practical applications, with Uday Ruddarraju describing Stargate as a "moonshot" project [7][8]. - The competition has prompted OpenAI to reconsider its compensation strategies to retain top talent amidst the aggressive recruitment by Meta [8]. Group 3: Industry Context - The AI industry has seen a surge in talent competition since the launch of ChatGPT in late 2022, with companies re-evaluating their hiring practices to secure leading researchers [13][15]. - Discussions around achieving "Artificial Superintelligence (ASI)" have become more prevalent, indicating a shift in focus towards groundbreaking technological advancements [14]. - The article notes that scaling capabilities are crucial for AI development, as using more data and computational power enhances model performance [16][17].
给你一群顶尖AI,如何组队才能发挥最大战力?UIUC用一个新的多智能体协作基准寻找答案
机器之心· 2025-07-09 04:23
Core Viewpoint - The article discusses the emergence of AI teams that collaborate like human teams in software development and scientific research, highlighting the need for effective evaluation metrics for these multi-agent systems [2][3]. Group 1: Introduction of MultiAgentBench - MultiAgentBench is introduced as a comprehensive benchmark for evaluating the collaboration and competition capabilities of LLM-based multi-agent systems [4][6]. - It aims to fill the gap in existing evaluation metrics that focus primarily on individual agent capabilities rather than the essential aspects of collaboration efficiency and communication quality [3][6]. Group 2: Key Findings and Contributions - The research reveals that the gpt-4o-mini model exhibits the strongest overall task performance among various models [8]. - The decentralized collaboration model using a graph structure is found to be the most efficient, while cognitive self-evolution planning significantly enhances task completion rates [8][12]. - MultiAgentBench identifies critical moments where agents begin to exhibit emergent social behaviors, providing insights into achieving AGI-level collaboration [9][12]. Group 3: Evaluation Framework - The framework includes a collaboration engine, an agent graph to structure relationships, and a cognitive module for personalized information and adaptive strategies [12][15]. - It incorporates diverse interaction strategies and six varied evaluation scenarios, simulating real-world team dynamics [19][20]. Group 4: Performance Metrics - The evaluation system uses milestone-based KPIs to assess task completion and collaboration quality, including task scores, communication scores, and planning scores [27][28]. - The findings indicate that high collaboration does not always correlate with superior task outcomes, emphasizing the importance of individual agent capabilities [30][32]. Group 5: Organizational Structure and Team Dynamics - The study highlights that decentralized organizational structures outperform hierarchical ones, which can lead to communication costs and inefficiencies [38]. - The "Ringelmann Effect" is observed, where increasing the number of agents can lead to diminishing returns in performance, underscoring the need for efficient collaboration mechanisms [40]. Group 6: Emergence of Social Intelligence - Notable emergent behaviors, such as strategic silence and trust differentiation, are observed in competitive scenarios, indicating a shift from pure logical reasoning to initial social behavior capabilities in AI agents [43][44]. - The findings suggest that under the right conditions, AI can learn and exhibit advanced social behaviors, marking a significant step towards more sophisticated artificial intelligence [48].
斯坦福毕业,用RL做Agent,华人创业团队种子轮融资1200万美元
机器之心· 2025-07-09 00:50
Core Viewpoint - Pokee AI has officially launched its public testing version, marking a significant milestone in its development journey and attracting attention from investors and the industry [1][8]. Group 1: Company Development - The company Pokee.ai was founded in October 2022, focusing on developing an interactive, personalized, and efficient AI Agent [4][9]. - The company has recently completed a $12 million seed round of financing led by Point72 Ventures, indicating strong investor interest [8]. - The pace of development has been rapid, with the product moving from concept validation to public testing in just over seven months [9]. Group 2: Technology and Approach - Unlike mainstream AI Agents that primarily utilize Large Language Models (LLM), Pokee.ai is centered around Reinforcement Learning (RL), with LLM serving as a user interface layer [5][17]. - The architecture allows for a more dynamic decision-making process, where RL models can utilize a broader action space compared to traditional LLMs [17]. - The ultimate goal is to create an AI Agent that can operate without extensive human configuration, allowing users to simply provide prompts for task completion [14][15]. Group 3: Market Perception and Challenges - Initially, many investors were skeptical about the RL-based approach, viewing it as unrealistic; however, perceptions have shifted as the technology gains traction [7][11]. - The challenge of aligning user intent with AI responses is significant, as users may not always articulate their needs clearly, complicating the AI's ability to deliver accurate results [18][20]. - The industry is still in the early stages of developing effective AI Agents, with many foundational steps yet to be completed [21]. Group 4: Team and Operations - The core team has expanded from four to seven members, with plans for further growth, but the company aims to maintain a lean structure to enhance efficiency [26][27]. - The company operates entirely online, leveraging remote work practices that have become common in the tech industry, allowing for flexibility and high productivity [30].
长思维链里的推理步骤,哪些最关键?三招锁定LLM的「命门句子」
机器之心· 2025-07-09 00:50
Core Viewpoint - The article discusses the importance of identifying key reasoning steps in large language models (LLMs) to enhance their interpretability, debuggability, and safety [2][6]. Group 1: Research Methods - The authors propose three complementary methods to analyze the reasoning process of LLMs, aiming to identify critical steps known as "thought anchors" that significantly influence subsequent reasoning [6][13]. - The first method is a black-box approach that measures the impact of sentences on final answers through counterfactual analysis, comparing the answer distributions with and without specific sentences [9][18]. - The second method is a white-box approach that identifies key sentences through attention patterns, revealing how these sentences affect the reasoning trajectory [10][24]. - The third method is a causal attribution approach that directly measures causal relationships between sentences by suppressing attention to specific sentences and observing the impact on subsequent logits [11][29]. Group 2: Findings and Implications - Each method provides evidence for the existence of thought anchors, which are crucial reasoning steps that disproportionately affect the reasoning process [13][15]. - The research indicates that planning generation and uncertainty management sentences consistently exhibit higher counterfactual importance compared to other sentence categories, supporting the idea that high-level organizational sentences can anchor and guide reasoning trajectories [23][25]. - The authors provide an open-source tool for visualizing the outputs of these methods, which can aid in debugging reasoning failures and identifying sources of unreliability [14][15]. Group 3: Case Study - The article includes a case study demonstrating the practical application of the proposed methods, using a specific problem involving the conversion of a hexadecimal number to binary [34][36]. - The resampling method reveals the initial incorrect reasoning trajectory and key turning points, highlighting the importance of specific sentences in achieving the correct answer [37][39]. - Attention analysis shows that the model's reasoning process is organized into distinct computational modules, with key sentences driving the flow of information and resolving contradictions [40][42].
500万视频数据集+全新评测框架!北大开源主体一致性视频生成领域新基建OpenS2V-Nexus,生成视频 「像」 又 「自然」
机器之心· 2025-07-08 09:41
然而,要训练和评价这样的模型,过去一直缺少公开可用的 大规模数据集和细粒度评测基准 ,限制了 S2V 技术的快速突破。 想让 AI 能 「看着你的自拍就生成一致且自然的短视频 」 吗?这就是 Subject-to-Video(S2V)生成 要解决 的问题:让视频生成不仅对齐文本,还能准确保留指定人物或物体的特征,让生成的视频既 「像 」 又 「自然 」 。这一能力对于短视频生成、虚拟人、AI 剪辑等都有巨大意义。 为此, 北大团队推出了全新的开源套件 OpenS2V-Nexus ,专为 S2V 生成打造: OpenS2V-Eval: 全球首个面向主体一致性、自然度和文本对齐的 S2V 细粒度评测基准,让不同模型在 主体一致性上真正可比。 OpenS2V-5M: 全球首个公开的 500 万条高质量 720P 人物文本视频三元组数据集 ,覆盖真实和合成数 据,帮助研究者快速训练更强大的生成模型。 北大团队还在 18 个代表性 S2V 模型上进行了系统评测,首次揭示了目前主流模型在保持主体一致性和自然 度方面的真实能力差距。 通过 OpenS2V-Nexus,未来做 AI 视频生成不再盲人摸象,让训练更高效、评测更 ...
还在为AI数据发愁?张文涛和鄂维南院士团队推出Data-centric AI系统
机器之心· 2025-07-08 09:41
近年来,大模型发展主要由大型科技公司主导,其领先的核心在于规模庞大且高质量的数据资源。然而,这些公司通常并不公开其原始数据及数据处理工具,使 得学术界在大模型训练数据的构建与优化方面难以追赶,受制甚深。 尽管近年来开源了大量数据集,学术界在大模型数据准备方面仍面临诸多挑战。目前,大模型训练数据的清洗与构建仍主要依赖各个研究团队 "闭门造车",缺乏 系统化、高效的工具支持 。现有的数据处理工具如 Hadoop 和 Spark 等, 支持的操作算子大多偏向传统方法,尚未有效集成基于最新大语言模型(LLMs)的智能 算子,对于构建先进大模型的训练数据支持有限。 为此,张文涛和鄂维南院士团队提出了以数据为中心的 AI 系统 DataFlow 。它系统实现了 100 余个基于规则、本地大模型或大模型 API 的数据治理算子 (Operators),并在此基础上构建 8 条预设数据处理流水线(Pipeline),包括:大规模嘈杂数据(如 PDF 文档、纯文本、低质量问答数据、爬虫数据等)的清 洗、扩增与评估;带有思维链的强推理数据合成;RAG 数据提取与合成等等主流数据治理需求。该系统可供用户灵活组织现有算子,开发新算子 ...
KAG-Thinker:「结构化」思考新范式,支持逻辑严谨的大模型复杂推理
机器之心· 2025-07-08 06:54
Core Viewpoint - The article discusses the release of the KAG-Thinker model by Ant Group's Knowledge Engine team in collaboration with Zhejiang University and Tongji University, focusing on structured reasoning for complex tasks, enhancing logical consistency and stability in reasoning processes. Group 1: Model Development and Features - KAG-Thinker is an important upgrade of the KAG framework, designed to construct a stable and interpretable reasoning paradigm for complex tasks in both general and specialized fields [1][3] - The model utilizes a dual semantic representation mechanism of natural language and logical functions to better leverage structured knowledge [3] - It combines breadth splitting and depth solving to improve the rigor of problem-solving, introducing a knowledge boundary determination mechanism centered on knowledge point alignment [3][10] Group 2: Performance and Evaluation - Experimental results show that KAG-Thinker outperforms state-of-the-art deep search methods by an average of 4.1% across seven single-hop and multi-hop reasoning datasets [6][24] - In single-hop datasets, KAG-Thinker achieved an average improvement of 4.5%, while in multi-hop datasets, the improvement was 3.9% [25] - The model demonstrated effectiveness in specialized fields, particularly in medical question-answering tasks, indicating its potential for fine-tuning in other professional domains [6][39] Group 3: Framework Integration and Stability - The KAG framework version 0.8 enhances knowledge base capabilities, supporting structured and unstructured data integration, and allows developers to customize indexing [28][29] - KAG-Thinker, integrated with the KAG framework, shows an average performance improvement of 3.0% in EM and 3.8% in F1 metrics compared to the standalone Thinker model [31] - Stability tests indicate that KAG-Thinker 7B outperforms previous versions in terms of consistent problem decomposition, achieving an average improvement of 17.9% and 7.6% under common temperature parameters [33]