Large Language Model
Search documents
ScienceQA最新榜单出炉!多家公司新模型分数均提升|xbench 月报
红杉汇· 2025-09-22 00:27
Core Insights - The latest xbench Leaderboard has been released, showcasing updates from six models that have entered the top 10, including GPT-5-high and Qwen3-235B-A22B-Thinking-2507, with scores improving by 3-5 points [1][9][10] - The dual-track evaluation system continues to track advancements in AGI, with a new question bank for the xbench-DeepSearch set expected to be released soon [1][2] Model Performance Summary - GPT-5-high from OpenAI shows a significant average score increase from 60.8 to 64.4, maintaining a stable BoN (N=5) score [9][12] - Qwen3-235B-A22B-Thinking-2507 has improved its average score from 45.4 to 55, with BoN scores rising from 66 to 77, indicating substantial enhancements [9][35] - Claude Opus 4.1-Extended Thinking has increased its average score from 46.6 to 53.2, with a slight BoN increase from 69 to 72 [9] - Kimi K2 0905 achieved an average score of 51.6, demonstrating a balance between model capability and response speed [9][28] - GLM-4.5 from ZHIPU scored 48.8 with a BoN of 74, while Hunyuan-T1-20250711 scored 44.4 with a BoN of 63 [9] - Grok-4 has shown a remarkable improvement, achieving a score of 65, marking it as a state-of-the-art model [9][10] Evaluation Insights - The distribution of model scores indicates a narrowing gap among the top performers, with the top five models scoring between 76-78 [10] - The overall performance of models suggests that advancements in model capabilities are reaching a plateau, with smaller incremental improvements noted across most models [10][12] - The xbench evaluation mechanism continues to provide real-time updates on model performance, with future rankings expected [2][8]
X @The Economist
The Economist· 2025-09-18 15:30
The Emiratis’ carefully calibrated large language model https://t.co/KYKJ4kgPct ...
DeepSeek-R1登上Nature封面:朝着AI透明化迈出的可喜一步
3 6 Ke· 2025-09-18 02:02
开源人工智能(AI)的价值正获得更广泛的认可。 刚刚,DeepSeek-R1 论文以封面文章的形式登上了权威科学期刊 Nature,DeepSeek 创始人兼 CEO 梁文峰为该论文的通讯作者。 论文链接:https://www.nature.com/articles/s41586-025-09422-z 研究团队假设,人类定义的推理模式可能会限制模型的探索,而无限制的强化学习(RL)训练可以更好地激励大语言模型(LLM)中新推理能力的涌 现。 他们通过实验证明,LLM 的推理能力可以通过纯 RL 来提升,从而减少增强性能所需的人类输入工作量,且在数学、编程竞赛和 STEM 领域研究生水平 问题等任务上,比经传统方法训练的 LLM 表现更好。 DeepSeek-R1 推出后,得到了全球开发者的广泛好评,截至发文前,其在 GitHub 上的 star 数已经达到了 91.1k。 在一篇同期发表的观点与评论文章中,卡内基梅隆大学助理教授Daphne Ippolito和他的博士生张益铭(现为 Anthropic 的 LLM 安全和对齐研究员)评价 道: "DeepSeek-R1 已从一个强大但不透明的解决方案寻找者 ...
DeepSeek-R1开创历史,梁文锋论文登上《自然》封面
Di Yi Cai Jing· 2025-09-17 23:09
与今年1月发布的DeepSeek-R1的初版论文相比,本次论文披露了更多模型训练的细节,并正面回应了 模型发布之初的蒸馏质疑。 DeepSeek-R1也是全球首个经过同行评审的主流大语言模型。Nature评价道:目前几乎所有主流的大模 型都还没有经过独立同行评审,这一空白"终于被DeepSeek打破"。 本次论文正面回应了模型发布之初的蒸馏质疑。 由DeepSeek团队共同完成、梁文锋担任通讯作者的DeepSeek-R1推理模型研究论文,登上了国际权威期 刊《自然(Nature)》的封面。 ...
X @The Economist
The Economist· 2025-09-17 18:01
We analysed each speech using OpenAI’s large language model, requesting that it assess how controversial King Charles’s remarks had been in the past three decades. This is what the results showed https://t.co/vCu2vKkDdu ...
100轮工具调用,8B小模型也能做复杂长搜索!MiniMax&港科大最新开源
量子位· 2025-09-12 08:46
不圆 发自 凹非寺 量子位 | 公众号 QbitAI 网络搜索Agent效果不好,猛猛投喂一波数据,表现还那样,咋回事? 港科大&MiniMax团队指出问题核心:不是模型参数不够多,而是缺乏足够有挑战性的训练数据。 换句话说,别死记硬背了,来做点"真题"吧。 他们提出了一种构建高质量QA对的方法 WebExplorer 。 用该方法构建的数据集去训练,即使是较小的模型,也可以在复杂、长程的搜索任务上超越更大的模型。 训练后的8B模型支持高达 128K的上下文长度 和 100次工具调用轮次 的长期推理,能在参数量低于10B的模型中取得顶尖结果。 网友评价:用模型驱动的方式做探索,确实比传统图谱方法更能让智能体的浏览行为变灵活。 模型及数据集均已开源,链接可见文末。 优质训练数据稀缺 随着大语言模型(LLM)的快速发展,智能体的能力边界不断扩展。 网络搜索智能体作为这一发展的重要组成部分,能够自主地从广泛的在线资源中检索信息;长视野(Long-Horizon)网络智能体更是需要在 多个网站间进行复杂的推理和搜索。 可是呢, 现有的开源网络智能体在处理复杂搜索任务时往往表现有限,更强大的商业模型又缺乏透明的训练细节 ...
阿里通义千问发布迄今最大模型——Qwen3-Max-Preview
Xin Lang Cai Jing· 2025-09-05 16:40
Core Insights - Alibaba's Tongyi Qianwen has launched its largest model to date, Qwen3-Max-Preview, with a parameter count of 1 trillion [1] - The new model shows significant enhancements in understanding both Chinese and English, following complex instructions, and tool invocation [1] - Qwen3-Max-Preview also significantly reduces instances of knowledge hallucination [1]
神州泰岳(300002.SZ)目前尚未私有化部署Grok 2.5
Ge Long Hui· 2025-09-03 09:00
Core Insights - The company has integrated multiple product lines through online API interfaces and private deployment of open-source models to connect with general large models like DeepSeek to serve various customer application scenarios [1] Group 1 - The company has multiple business lines and products that have successfully connected to DeepSeek [1] - The current status indicates that the company has not yet privatized the deployment of Grok 2.5 [1]
X @Avi Chawla
Avi Chawla· 2025-09-03 06:31
Core Technologies - Tool Calling enables Large Language Models (LLMs) to determine appropriate actions [1] - MCP (Model Control Plane) infrastructure ensures tool reliability, discoverability, and executability [1] - Tool Calling requests can be routed through the MCP [1]
Claude Code 的设计哲学:Keep Things Simple
Founder Park· 2025-08-31 02:06
Core Insights - The article emphasizes the effectiveness of Claude Code due to its simplicity in design and functionality, contrasting it with other AI assistants that focus on adding features [2][6][33]. Group 1: Design Philosophy - Claude Code adopts an extremely minimalist approach, utilizing a single main loop and a clear set of tools, which allows it to perform 80% of tasks with a low-cost small model [2][4][14]. - The system is designed to manage its own task list, marking progress autonomously, which enhances user experience by reducing the need for manual input [2][11][27]. - The use of a context file (claude.md) is crucial for remembering user preferences and coding habits, significantly improving the interaction quality [19][20]. Group 2: Model Utilization - Over 50% of the important LLM calls in Claude Code utilize the smaller Haiku model, which is cost-effective and sufficient for most tasks, leading to a reduction in operational costs by 70-80% [17][18]. - The article suggests that using smaller models for the majority of tasks can simplify the system and improve performance [17][18]. Group 3: Prompt Engineering - Claude Code's prompts are highly detailed, containing around 2800 tokens for system prompts and 9400 tokens for tool descriptions, which serve as comprehensive guidelines for the model [18][22]. - The article highlights the importance of using XML tags and Markdown to organize prompts effectively, which enhances clarity and usability [21][22]. Group 4: Task Management - The system's ability to maintain a to-do list autonomously helps prevent context decay over time, allowing the model to stay focused on tasks [27]. - The article critiques the multi-agent approach, advocating for a single-agent system that can manage tasks efficiently without the added complexity [15][27]. Group 5: Tool Design - Claude Code employs a mix of low-level and high-level tools, allowing for flexibility in task execution while maintaining clarity in tool usage [24][25]. - The article stresses the importance of providing detailed tool descriptions and examples to guide the model in its operations [25][26]. Group 6: Overall Takeaway - The primary lesson from Claude Code's design is to keep things simple, as complexity can hinder performance and make debugging more challenging [33].