Workflow
大语言模型(LLM)
icon
Search documents
永别了,人类冠军,AI横扫天文奥赛,GPT-5得分远超金牌选手2.7倍
3 6 Ke· 2025-10-12 23:57
国际奥赛又一块金牌,被AI夺下了!在国际天文与天体物理奥赛(IOAA)中,GPT-5和Gemini 2.5 Pro完胜人类选手,在理论和数据分析测试 中,拿下了最高分。 IMO、IOI之后,AI再夺奥赛冠军。 刚刚,在国际天文与天体物理奥林匹克竞赛测试中,GPT-5和Gemini 2.5 Pro达到金牌水平! 在理论考试上,Gemini 2.5 Pro总体得分85.6%,GPT-5总体得分84.2%; 在数据分析考试中:GPT-5总体得分88.5%,Gemini 2.5 Pro总体得分75.7%。 | | | | Theory Exams | | | | | Data Analysis Exams | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Model | Easy | Medium | Hard | Extra Hard | Overall | Easy | Medium | Hard | Overall | | | | | | | Mean ± SD | | | | Mean ± SD | | GPT-5 | 84 ...
从组件到系统,Agent 的 Evaluation 怎么做?
机器之心· 2025-10-12 01:27
--- 本周为您解读 ② 个值得细品的 AI & Robotics 业内要事 --- 1.从组件到系统,Agent 的 Evaluation 怎么做? 为什么 Agent 需要新的评估基准?Agent 与 LLM 的定位有何本质差别?Agent 评估范式在如何演进?GAIA 系列如何跨越 Agent Evaluation 的边界?MCP-universe、MCPMark 和 MCP- AgentBench 的反映了什么样的设计哲学?... 2. CoT 之后,CoF 如何让帧间逻辑从「隐式对齐」变成「显式思考」? CoT 只是「语言的表层叙事」,而非真正的推理?CoF 如何把「语言的思维链」转译为「视频的帧链」?CoF 为何被认为可能成为视频生成模型的「新范式」,它相较传统帧间一致性优化方法 的优势如何?从 CoF-Data 到 VChain,研究者如何把「推理链」嵌进每一帧画面?在 CoF 出现之前,视频模型靠什么维系「帧间一致性」?... 本期完整版通讯含 2 项专题解读 + 34 项本周 AI & Robotics 赛道要事速递,其中技术方面 13 项,国内方面 7 项,国外方面 14 项。 机器之心P ...
破解MoE模型“规模越大,效率越低”困境!中科院自动化所提出新框架
量子位· 2025-10-11 01:15
下面详细来看—— 一套统一框架直击MoE底层运作模式 随着LLM参数规模的持续扩张,模型规模增长与计算效率优化难以协同推进的核心挑战逐渐显现,混合专家模型(MoE)作为一种稀疏激活架 构,为模型规模的持续扩展提供了理论上极具吸引力的技术途径。 中科院自动化所团队 投稿 量子位 | 公众号 QbitAI 大模型参数量飙升至千亿、万亿级,却陷入"规模越大,效率越低" 困境? 中科院自动化所新研究给出破局方案—— 首次让MoE专家告别"静态孤立",开启动态"组队学习" 。 具体而言,MoE本是大语言模型(LLM)实现参数量扩张且计算成本仅呈线性增长的核心路径,却长期受困于负载失衡、参数冗余、通信开销 的"三难困境",成为大模型落地部署的主要瓶颈。 而中科院自动化所的研究团队通过专家集群动态重组,不仅让大模型总参数量 直降80% ,负载方差 降低至原来的三分之一 ,消耗内存更 直 逼轻量级传统稠密模型 ,更一举达成通信延迟、负载均衡、内存占用的三重优化,为大参数LLM的低成本部署提供了新路径。 例如,负载均衡损失函数是一种被动的补偿机制;参数压缩技术(如MoE-Lite)虽减少了参数,却将专家视为独立的实体,忽视了其 ...
中金 | 大模型系列(4):LLM动态模型配置
中金点睛· 2025-09-23 00:14
Core Viewpoint - The article emphasizes the importance of dynamic strategy configuration in quantitative investing, highlighting the limitations of traditional models and proposing a new framework based on large language models (LLM) for better adaptability to changing market conditions [2][3][5]. Group 1: Evolution of Quantitative Investing - Over the past decade, quantitative investing in the A-share market has evolved significantly, driven by the search for "Alpha factors" that can predict stock returns [5]. - The rapid increase in the number of Alpha factors does not directly translate to improved returns due to the quick decay of Alpha and the homogenization of factors among different institutions [5][12]. Group 2: Challenges in Factor Combination - Different factor combination models exhibit significant performance differences across market phases, making it difficult to find a single model that performs optimally in all conditions [12]. - Traditional models, such as mean-variance optimization, are sensitive to input parameters, leading to instability in performance [14][15]. - Machine learning models, while powerful, often suffer from a "black box" issue, making it hard for fund managers to trust their decisions during critical moments [16][18]. Group 3: Proposed LLM-Based Framework - The proposed "Judgment-Inference Framework" consists of three layers: training, analysis, and decision-making [2][3][19]. - **Training Layer**: Runs a diverse set of selected Alpha models to create a robust strategy library [22]. - **Analysis Layer**: Conducts automated performance analysis of models and generates structured performance reports based on market conditions [24][27]. - **Decision Layer**: Utilizes LLM to integrate information from the analysis layer and make informed weight allocation decisions [28][31]. Group 4: Empirical Results - Backtesting results on the CSI 300 index show that the LLM-based dynamic strategy configuration can achieve an annualized excess return of 7.21%, outperforming equal-weighted and single model benchmarks [3][41]. - The LLM dynamic combination exhibited a maximum drawdown of -9.47%, lower than all benchmark models, indicating effective risk management [44]. Group 5: Future Enhancements - The framework can be further optimized by expanding the base model library to include more diverse strategies and enhancing market state dimensions with macroeconomic and sentiment indicators [46].
20年后你会患哪些疾病?这款AI大模型登上Nature,能够预测上千种疾病风险
生物世界· 2025-09-19 04:04
撰文丨王聪 编辑丨王多鱼 排版丨水成文 20 年后,你会患上哪些疾病 ?这个看似无法回答的问题如今可能有了答案—— 一款名为 Delphi-2M 的 AI 大模型能够 通过分析一个人的医疗记录和生活方式,为超过 1000 种疾病提供风险评估,甚至能够提前数十年做出精准预测。 这项研究于 2025 年 9 月 17 日 发表在了国际顶尖学术期刊 Nature 上 ,论文题为: Learning the natural history of human disease with generative transformers ,研究团队 来自 德国癌症研究中心 (DKFZ) 、 欧洲分子生物学实验室欧洲生物信息学研究所 (EMBL-EBI) 、哥本哈根大学。 该研究开发了一款名为 Delphi-2M 的 AI 大模型,具有 令人惊叹的一次性模拟和预测多种疾病的能力, 利用健康记录和生活方式因素来预测一个人在未来 20 年 内患上癌症、皮肤病、免疫疾病等多达 1258 种疾病的可能性,从而 生成完整的未来健康轨迹,帮助 医生和健康规划者更好地理解和应对个性化健康需求。 对于大多数疾病 (包括痴呆症、心血管疾病以及死 ...
DeepSeek团队发表重磅论文,《自然》配发社论狂赞呼吁同行效仿
Yang Zi Wan Bao Wang· 2025-09-18 13:19
Group 1 - The DeepSeek-R1 inference model research paper has been published on the cover of the prestigious journal Nature, marking it as the first mainstream large language model (LLM) to undergo peer review, which is significant for AI model development [2][4] - The paper reveals more details about the model's training compared to its initial version released in January, indicating that the reasoning capabilities of LLMs can be enhanced through pure reinforcement learning, reducing the human input required for performance improvement [2][9] - Since its release in January, DeepSeek-R1 has become the most downloaded product for solving complex problems on the platform, and it has undergone evaluation by eight experts on originality, methodology, and robustness [9] Group 2 - Nature's editorial emphasizes the importance of peer review for AI models, noting that almost all mainstream large models have not undergone independent peer review until DeepSeek broke this gap [4][6] - Peer review helps clarify the workings of LLMs and assess whether they truly achieve their claimed functionalities, which is particularly crucial given the significant implications and potential risks associated with LLMs [6][10] - The editorial calls for other AI companies to follow DeepSeek's example, suggesting that if this practice becomes a trend, it could greatly promote the healthy development of the AI industry [10]
链接全球!腾讯云海外客户规模一年翻番
Sou Hu Cai Jing· 2025-09-16 23:18
Core Insights - Tencent Cloud's international business has become a new growth engine, with significant year-on-year revenue growth in Q2 2025 and a doubling of overseas customer base over the past year [1][2] - Over the past three years, Tencent Cloud's international business has maintained high double-digit growth, with over 90% of internet companies and more than 95% of leading gaming companies choosing Tencent Cloud for their international expansion [1] - Tencent Cloud has launched innovative products like EdgeOne Pages, which integrates large language models with its MCP Server, enabling users to deploy complete e-commerce pages in minutes, helping over 100,000 users enter global markets within three months [1] International Expansion - Tencent Cloud's overseas customer base has doubled in the past year, covering over 80 countries and regions, supported by competitive products and localized service networks [2] - The company plans to add new data centers in Saudi Arabia and Osaka, enhancing its infrastructure with over 3,200 global acceleration nodes across 21 regions to provide fast, stable, and reliable services [2] - Tencent Cloud's internationalization efforts are deepening, with partnerships established with well-known international companies such as GoTo Group, Charoen Pokphand Group, e&UAE, Orange, and Com2uS [2]
告别错误累计与噪声干扰,EviNote-RAG 开启 RAG 新范式
机器之心· 2025-09-12 00:51
本文第一作者戴语琴,清华大学博士生。该工作为戴语琴在蚂蚁大安全实习期间完成,该工作属于蚂蚁集团大安全 Venus 系列工作,致力于打造搜索智能体 / UI 智能体。本文通讯作者为该校副教授吕帅,研究方向包括大语言模型、多模态生成、AI4Design。共同通讯作者沈永亮,浙江大学百人计划研究员,博士生导 师,研究方向包括大模型推理、RAG 检索增强生成、多模态生成模型等。 在检索增强生成(RAG)飞速发展的当下,研究者们面临的最大困境并非「生成」,而是「稳定」。 低信噪比 让关键信息淹没在冗余文档里, 错误累计 则让推理链像骨牌一样层层坍塌。这两大顽疾,使得现有 RAG 系统在复杂任务中难以真正可靠。 近期,一项由蚂蚁集团、清华大学、浙江大学、MIT、UC Berkeley、香港大学和新加坡国立大学等机构联合完成的研究提出了全新方案—— EviNote-RAG 。它 不仅在多个权威基准上实现了显著性能提升,更在训练稳定性与推理可靠性上带来了质的飞跃。 核心秘诀在于两个创新: 这一组合带来的改变是革命性的:训练曲线不再震荡,答案推理更加稳健。消融与补充实验进一步验证了这一点—— SEN 是性能提升的基石,而 EQ ...
攻克AI推理难题,清华团队提出「统一LLM强化学习新范式」ReST-RL
3 6 Ke· 2025-09-10 09:53
Core Insights - The article discusses the ongoing debate in the industry regarding the reasoning capabilities of large language models (LLMs), highlighting their frequent failures in complex tasks and the challenges in improving their reasoning abilities [1][3]. Group 1: Current Challenges in LLMs - Existing LLMs struggle with complex code, multi-step logic, and abstract tasks, often resulting in logical errors and irrelevant responses [1]. - Current reinforcement learning (RL) methods, such as online RL and self-training, have shown potential in enhancing LLM reasoning but face limitations in training efficiency and data collection costs [3][4]. - The reliance on high-quality labeled data for training process reward models (PRMs) restricts the scalability and reliability of these methods [4]. Group 2: Introduction of ReST-RL - Tsinghua University's KEG team proposed a new RL paradigm called ReST-RL, which combines an improved GRPO algorithm with a value model (VM) assisted decoding method to enhance LLM reasoning capabilities while maintaining efficiency and scalability [1][5]. - ReST-RL consists of two main components: ReST-GRPO, which optimizes the training process, and VM-MCTS, which aids in decoding during testing [5][9]. Group 3: Performance and Validation - Experimental results indicate that ReST-RL outperforms other RL baselines and decoding methods across various programming benchmarks, demonstrating its significant potential in enhancing LLM reasoning capabilities [2][10]. - ReST-GRPO improves training efficiency compared to original GRPO and DAPO, while VM-MCTS shows superior accuracy in validation tasks [10]. Group 4: Limitations and Future Directions - Despite the promising results, ReST-RL has not been validated in tasks beyond code reasoning, such as mathematical or commonsense reasoning, indicating a need for further research [13][14]. - The accuracy of the value model in out-of-domain tasks remains underexplored, suggesting that future work should focus on its generalization capabilities across a broader range of tasks [14].
ICLR 2026史上最严新规:论文用LLM不报,直接拒稿
3 6 Ke· 2025-08-29 03:23
Core Points - ICLR 2026 has introduced strict regulations regarding the use of Large Language Models (LLMs) in paper writing and reviewing, requiring explicit acknowledgment of LLM usage [1][15][16] - The new policies aim to ensure accountability among authors and reviewers, mandating that they take full responsibility for their contributions [16][20] Group 1: New Regulations - The ICLR 2026 committee has established two main policies regarding LLM usage: all LLM usage must be clearly stated, and authors and reviewers must be accountable for their contributions [15][16] - The policies are in line with ICLR's ethical guidelines, which emphasize the importance of acknowledging all research contributions [15][16] - Violations of these policies will result in immediate rejection of submissions, reflecting the committee's commitment to maintaining ethical standards [17] Group 2: Submission Details - The submission deadlines for ICLR 2026 are set, with the abstract submission deadline on September 19, 2025, and the paper submission deadline on September 24, 2025 [9] - The total number of submissions for ICLR 2025 reached 11,565, with a 32.08% acceptance rate, indicating a growing trend in submissions [3][5] Group 3: Ethical Concerns - There have been instances of authors using hidden prompts to manipulate reviewer feedback, which is considered a serious ethical violation [21][24] - The committee has highlighted the potential risks associated with LLMs, including the possibility of generating false information or breaching confidentiality [20][24] Group 4: AI in Review Process - The use of LLMs in the review process has been tested, showing that AI suggestions were adopted in 12,222 instances, with 26.6% of reviewers updating their evaluations based on AI feedback [29][32] - The integration of LLMs has been shown to enhance the quality of reviews and increase engagement during the rebuttal phase [32][34]