Workflow
机器之心
icon
Search documents
大模型如何泛化出多智能体推理能力?清华提出策略游戏自博弈方案MARSHAL
机器之心· 2026-01-09 04:08
Core Insights - The MARSHAL framework, developed by Tsinghua University and other institutions, utilizes reinforcement learning for self-play in strategy games, significantly enhancing the reasoning capabilities of large models in multi-agent systems [2][7][31] - The framework addresses two main challenges in multi-agent systems: credit assignment in multi-round interactions and advantage estimation among heterogeneous agents [5][7] Background and Challenges - Existing models like DeepSeek-R1 have shown the value of verifiable reward reinforcement learning (RLVR) in single-agent scenarios, but its application in complex multi-agent interactions is still in exploration [5] - The two core technical challenges identified are: 1. Credit assignment in multi-round interactions, where existing methods struggle to accurately trace back results to specific actions [5] 2. Advantage estimation among heterogeneous agents, which complicates joint training and leads to performance volatility [7] MARSHAL Method Introduction - MARSHAL employs Group-Relative Policy Optimization (GRPO) architecture and introduces two key algorithmic improvements to enhance multi-agent reasoning capabilities [12][14] - The framework was tested using six strategy games, with three for training and three for testing, covering a range of competitive and cooperative scenarios [12] Core Experiments - The MARSHAL-trained expert agents demonstrated a significant performance increase, achieving up to 28.7% higher win rates in testing games [13][19] - The model showed remarkable generalization capabilities, with accuracy improvements of 10.0% in AIME and 7.6% in GPQA across various reasoning tasks [19][20] Reasoning Mode Analysis - Qualitative analysis revealed that the training in games fostered two emergent capabilities: Role-Awareness and Intent Recognition, which are crucial for decision-making in uncertain environments [22] - Quantitative analysis indicated that MARSHAL reduced inter-agent misalignment by 11.5%, enhancing communication efficiency among agents [24] Ablation Studies - Self-play training outperformed fixed opponent training, as models trained against fixed opponents tended to overfit, leading to poor performance in testing scenarios [26] - The necessity of the Turn-level Advantage Estimator and Agent-specific Advantage Normalization was confirmed, highlighting their importance in handling long-sequence decisions and addressing reward distribution differences [28] Conclusion - The MARSHAL framework successfully enhances the reasoning capabilities of large language models in multi-agent systems through self-play in strategy games, indicating potential for broader applications in complex multi-agent environments [31][34]
谁家更新日志那么长啊?Claude Code版本更新引围观,1096次提交一口气上线
机器之心· 2026-01-09 04:08
编辑|张倩 根据 Claude 的总结,这次更新大概包含以下类型: 如果你是 Claude Code 的用户,你可能会注意到,它最近有个重要的版本更新,从节前的 2.0.76 更新到了 2.1.0。 而且,这次的日志,你得往下翻好几屏。 翻完这个日志,网友不淡定了,有的纳闷「是有个超级智能体在帮他们写代码吗」? 还有人调侃说「求求了,谁去跟他们说一下什么叫 rolling release(滚动更新)吧」「照这个速度,我们周五早上就能用上新操作系统了」。 | Anthony Morris ツ � @amorriscode · Jan 8 | | | | | --- | --- | --- | --- | | boris should go on vacation more often | | | | | 鲍里斯应该多去度假 | | | | | 02 | U | C 16 | 111 779 | | Adam Hawley @ @_adamjhawley · Jan 8 | | | | | Someone tell them about rolling releases pls | | | | | 请哪位好心人告 ...
Agent 2.0时代来了,首批「工业级智能体」正在核心位置上岗
机器之心· 2026-01-09 04:08
Core Insights - The article discusses the transformative impact of AI tools on work efficiency, suggesting that if these tools had been available earlier, many tasks could have been completed much faster [2][5]. - A new working paradigm centered around AI agents is emerging, significantly altering workflows in development and data analysis [5]. Group 1: AI Tools and Efficiency - AI tools have led to substantial reductions in project completion times, with engineers from major tech companies sharing their experiences [2][5]. - The focus of AI applications is shifting from validating usability to realizing actual value, with upgrades to application components aimed at lowering the entry barrier for users [10]. Group 2: Alibab Cloud's Baolian Upgrades - Alibaba Cloud's Baolian has undergone a comprehensive upgrade, marking the transition from a "handcrafted workshop" era to an "industrial assembly line" era for AI agents [6]. - The upgraded Baolian framework includes a "1+2+N" blueprint, which encompasses model and cloud services, high-code and low-code development paradigms, and task-specific development components [6]. Group 3: Multi-modal Data Integration - The ability to integrate and utilize multi-modal data is crucial for large-scale AI applications, with Baolian enhancing its multi-modal knowledge base capabilities to support various file types [12][15]. - Baolian's upgrades allow for flexible processing of multi-modal data, enabling users to orchestrate document, image, audio, and video data through a visual interface [13]. Group 4: Asynchronous API and Cost Efficiency - Baolian has introduced an asynchronous API that extends the timeout limit for long-running tasks from 5 minutes to over 24 hours, ensuring stable execution of lengthy tasks [18]. - The idle scheduling feature of Baolian can reduce AI inference costs by over 50% [19]. Group 5: Development Framework - Baolian provides a dual-mode development capability, allowing both high-code and low-code approaches to coexist, catering to different roles within enterprises [23]. - The upgraded Agent 2.0 architecture enhances task planning and introduces a "Plan-Execute-React" feedback loop, improving the overall development process [26]. Group 6: Model and Cloud Services - The model service layer of Baolian has been strengthened to enhance enterprise-level capabilities, supporting structured metadata display and multi-model comparisons [33]. - Baolian offers a native training and fine-tuning capability for its models, enabling businesses to create customized models using their own data [36]. Group 7: Security and Deployment - Baolian's confidential inference service utilizes a trusted execution environment to provide high-security model inference capabilities [37]. - The release of the enterprise version of the Agent platform allows for the development and deployment of AI agents in private clouds and on-premises environments [40]. Group 8: Industry Implications - The upgrades to Baolian are expected to lower the barriers for AI technology adoption across various industries, facilitating the emergence of AI as a capable "digital employee" [43][45].
AAAI 2026 Oral | 大模型「爱你在心口难开」?深度隐藏认知让推理更可靠
机器之心· 2026-01-09 02:53
Core Insights - The article discusses the advancements in large language models (LLMs) in reasoning tasks, particularly emphasizing the Chain-of-Thought (CoT) technique, which enhances model performance by generating intermediate reasoning steps before arriving at a final answer [2][6] - A research team from Hefei University of Technology proposes that LLMs possess a "hidden cognition" that allows them to internally assess the correctness of their reasoning, even if this is not reflected in the token probabilities during generation [2][10] - The paper introduces a framework that enables models to score their reasoning steps based on this hidden cognition, thereby improving the reliability of CoT [2][10] Summary by Sections Introduction - The article highlights the growing application of LLMs in various reasoning tasks and the importance of maintaining stable and reliable reasoning quality throughout the generation process [6][8] - It identifies factors that can affect the reliability of reasoning chains, such as subtle biases in understanding, expression noise, and cumulative errors in long chains [6][8] Research Motivation - The research aims to determine if there are internal signals within the model that can reflect the reliability of current reasoning steps, potentially guiding the model to continue with more reliable paths [7][15] - The study focuses on two key questions regarding the existence of discernible signals in internal activations and the feasibility of constructing a mechanism to utilize these signals [8][15] Methodology and Innovations - The proposed method involves detecting "truth sensitivity" from multiple attention heads and training a simple probe on internal representations to assess which layers are most sensitive to reasoning correctness [10][11] - A confidence predictor is constructed using the most sensitive attention heads to output reliability scores for each reasoning step, based on deep internal representations rather than token probabilities [12][21] - The research introduces a confidence-guided search strategy that combines model generation probabilities with confidence scores to filter the most reliable reasoning paths [13][16] Experimental Results - The study evaluates the effectiveness of the confidence predictor and its application in guiding reasoning paths across various benchmarks, including both single-modal and multi-modal reasoning tasks [22][24] - Results indicate that the proposed method consistently outperforms baseline models, achieving significant improvements in reasoning accuracy across different datasets [23][24] - Ablation studies confirm the critical role of the confidence predictor in enhancing reasoning performance, with random selection of reasoning steps leading to a notable decline in effectiveness [25][27]
医疗领域DeepSeek时刻:蚂蚁 · 安诊儿医疗大模型正式开源,登顶权威榜单
机器之心· 2026-01-09 02:53
Core Insights - The article discusses the transformative impact of AI on how people access medical information, highlighting the increasing reliance on AI tools like ChatGPT for health-related inquiries [1][2][3]. Group 1: AI in Healthcare - OpenAI's report reveals that over 5% of global ChatGPT conversations are health-related, with 40 million daily inquiries about health issues [3]. - A significant portion of users employs AI to explore symptoms (60%) and understand medical terminology or clinical advice (52%) [3]. - OpenAI launched ChatGPT Health to integrate personal health information with AI capabilities, aiding users in understanding their health status and making informed decisions [3]. Group 2: AntAngelMed Model - AntAngelMed, developed by Ant Group in collaboration with Zhejiang health authorities, is an open-source medical model with 100 billion parameters, making it the largest in the medical field [5]. - The model has excelled in evaluations like HealthBench and MedAIBench, outperforming other general models and existing medical reasoning models [5][7]. - AntAngelMed ranks first in the MedBench leaderboard, showcasing its superiority in medical knowledge Q&A and ethical safety dimensions [7][8]. Group 3: Training and Architecture - AntAngelMed employs a three-stage training process focused on building medical capabilities [12]. - The first stage involves continuous pre-training with high-quality medical data to establish a robust medical knowledge structure [14][15]. - The second stage includes supervised fine-tuning for real medical tasks, enhancing the model's reasoning stability and contextual understanding [16][17]. - The third stage utilizes reinforcement learning to ensure the model's responses are reliable and responsible, particularly in sensitive situations [18][20]. Group 4: Performance and Efficiency - AntAngelMed's architecture is a high-efficiency mixture of experts (MoE) model, achieving up to 7 times the efficiency of dense architectures [30]. - The model can process over 200 tokens per second in an H20 hardware environment, significantly improving response times in medical applications [31]. - AntAngelMed's context length is extended to 128K, enhancing its ability to handle complex medical records and reports [33]. Group 5: Practical Applications - AntAngelMed provides quick and detailed responses to health-related queries, offering personalized advice based on individual health conditions [37][40]. - The model's open-source nature allows for downstream task fine-tuning, lowering the barrier for advanced medical AI technology applications [44]. - Ant Group aims to promote an open-source ecosystem for AI in healthcare, facilitating broader access to innovative technologies for developers and users [44].
明天上市,MiniMax上市额度已经被抢疯了
机器之心· 2026-01-08 14:24
Core Viewpoint - MiniMax, a large model company, is set to list on January 9, achieving a record in institutional subscriptions for Hong Kong IPOs with over 460 participating institutions and an oversubscription rate exceeding 70 times [1][2]. Subscription Details - The previous record for subscriptions was held by CATL, which had a 30 times oversubscription when it went public in Hong Kong in 2025 [2]. - Demand for MiniMax's national placement orders reached $32 billion, with actual orders totaling $19 billion from over 460 institutions, resulting in an oversubscription of approximately 79 times after excluding cornerstone investors [2]. - Notable long-term funds and sovereign wealth funds participated, including those from Singapore, South Africa, the Middle East, and Canada, with several orders exceeding $1 billion [2]. Market Performance - On January 8, MiniMax's dark market showed a strong opening, peaking at HKD 211.2 per share, with a closing price of HKD 205.6, reflecting a 24.6% increase [3]. Revenue Sources - MiniMax's revenue is primarily derived from two segments: AI-native products and enterprise services based on AI, with AI-native products generating $38.02 million (over 70% of total revenue) and enterprise services contributing $15.41 million (28.9%) by June 2025 [3][4]. - As of September 2025, MiniMax had accumulated 212 million users for its AI-native products, with over 1.77 million being paid users [3]. Financial Performance - MiniMax reported a loss of approximately $180 million as of September 2025, with cash reserves exceeding $362 million [4]. - The company’s business model is perceived as clear and diversifying, instilling confidence among investors regarding its path to break-even [5].
博士申请终极指南:「从准备到抉择」手把手教你拿下理想offer
机器之心· 2026-01-08 09:34
机器之心编辑部 又快到博士申请季。这是一份复杂而又繁琐的工作:无尽的 院校调研、纠结的方向选择、厚重的材料准备,以及决定命运的面试……不可能不感到迷茫、焦虑, 甚至怀疑, 这一切的辛勤付出,究竟能否换来梦想院校的入场券?在面试官眼中,「完美候选人」究竟应该具备哪些条件…… 最近,加州大学圣地亚哥分校认知科学家兼助理教学教授 Lucy Lai,结合 她以往哈佛大学神经科学博士项目申请者的经验,七年多的模拟面试经验,以及作为 前 哈佛博士项目面试官的经验 , 给出了一份「内部参指南 」——《关于博士申请的一切》。 《指南》中包括常见的博士面试问题与如何做出最好的回答、招生决定是如何做出的,以及对招生委员会所看重的素质和因素进行的详细说明等。 接下来,我们就具体来看看指南是如何给出申请建议的。 一般应用技巧 如何才能确定自己想读研究生? 在所有的准备开始之前,需要先明确一个问题:你真的决定要读研究生了。Lucy Lai 建议,思考过程中如果觉得自己的申请材料还不够优秀,可以考虑休学一年或 几年。而判断申请是否足够优秀的一个好方法是咨询你的研究导师。他们阅读和面试过无数研究人员和潜在的研究生,可以很轻松地告诉你在申请 ...
「听觉」引导「视觉」,OmniAgent开启全模态主动感知新范式
机器之心· 2026-01-08 09:34
Core Insights - The article introduces OmniAgent, a proactive perception agent developed by Zhejiang University, West Lake University, and Ant Group, addressing pain points in cross-modal alignment and fine-grained understanding in end-to-end omni-modal models [2][7][19] - OmniAgent employs an innovative "think-act-observe-reflect" closed-loop mechanism, transitioning from passive response to active inquiry, which enhances its performance in audiovisual understanding tasks [10][19] Background and Pain Points - End-to-end omni-modal models face high training costs and challenges in cross-modal feature alignment, leading to subpar performance in fine-grained cross-modal understanding [7] - Fixed workflow-based agents rely on rigid, human-defined processes, lacking the flexibility to autonomously plan and gather information based on questions [7] Methodology - OmniAgent's methodology includes a strategic scheduling of video and audio understanding capabilities within an iterative reflection loop, effectively overcoming cross-modal alignment challenges [8][15] - The agent autonomously decides whether to "listen" or "watch" based on the analysis of the question, utilizing a variety of multimodal tools for efficient information retrieval [15] Performance Results - OmniAgent achieved state-of-the-art (SOTA) results in multiple audiovisual understanding benchmarks, with an accuracy of 82.71% on the Daily-Omni Benchmark, surpassing Gemini 2.5-Flash (72.7%) and Qwen3-Omni-30B (72.08%) by over 10% [13] - In the OmniVideoBench, OmniAgent reached an accuracy of 59.1% in long video understanding tasks, significantly outperforming Qwen3-Omni-30B (38.4%) [13] Future Vision - The design of OmniAgent is highly extensible, allowing for the integration of additional modal tools [19] - OmniAgent is positioned to assist in generating high-quality COTT data for the development of next-generation omni-modal models capable of self-tool invocation [19]
拓宽百年奥运「赛场边界」,阿里云AI让人人皆可上场
机器之心· 2026-01-08 09:34
机器之心编辑部 先给大家看个视频,你能分辨出哪个是 AI 生成的吗? 视频来源: tiktok 博主 @tkp..1001 「真人拍摄还是 AI 生成」,如果搁一年前,这个问题还很容易回答,因为细节处总有一眼 AI 的破绽,但现在,真与假的界限已变得愈发模糊。 越来越多「真实」的视频,评论区里都在争论「这是 AI 吧?」而那些真正由 AI 生成的内容,反倒被当成真实拍摄。 AI 视频生成技术的进化速度快到飞起,并正渗透进我们生活的方方面面。随之而来的问题是:我们究竟要如何与这些技术共处? 破解这一难题的钥匙或许就藏在人类的想象力中。技术的超越不该只在于对现实的复刻,更应在创新应用中想象更美好的未来。 站在这个视角,阿里云给出了一个颇具想象力的答案:2026 年米兰冬奥会。 就在冬奥会倒计时 30 天之际, 作为官方云服务合作伙伴的阿里云,拉着国际奥委会以及⽶兰冬奥组委会搞了波大的,共同发起一场全球 AIGC ⼤赛 。 [ 左右滑动 ] 大赛 Slogan 为「 YOUR EPIC VIBE 」,正好与本届冬奥口号「 IT's Your Vibe 」(意展你风采)遥相呼应。 大赛规则简单粗暴:只需用阿里云的「 ...
刚刚,智谱敲钟上市了,市值达528亿港元
机器之心· 2026-01-08 02:06
机器之心发布 「全球大模型第一股」来了! 2026 年 1 月 8 日,北京智谱华章科技股份有限公司(02513.HK)(以下简称「智谱」)正式在香港联合交易所挂牌上市。 他回顾称,智谱在 2021 年推出了自研的算法架构 GLM,而今年 GLM-4.7 的发布使其跻身世界领先,为冲刺 AGI 打下重要根基。 「智谱的 Z 是字母表中的最后 一个,代表终极境地,我们希望在 AGI 的探索历程上能走到智能的终极境地。」 凭借「全球大模型第一股」标的的独特稀缺性,智谱吸引了一支由北京核心国资、头部保险资金、大型公募基金、明星私募基金和产业投资人构成的全明星基石 投资阵容,JSC International Investment Fund SPC、JinYi Capital Multi-Strategy Fund SPC、Perseverance Asset Management 等 11 家基石投资者合计认购 29.8 亿港元。 以基座模型为核心,持续探索智能上界 智谱是中国最早投身大模型研发的厂商之一,原创性地提出了基于自回归填空的通用预训练范式 GLM,率先发布了中国首个百亿模型、首个开源千亿模型、首个 对话 ...