Workflow
Doubao
icon
Search documents
密室逃脱成AI新考场,通关率不足50%,暴露空间推理短板丨清华ICCV25
量子位· 2025-07-12 04:57
清华大学团队 投稿 量子位 | 公众号 QbitAI 近年来,多模态大模型(MLLMs)发展迅猛,从看图说话到视频理解,似乎无所不能。 但你是否想过:它们真的"看懂"并"想通"了吗? 模型在面对复杂的、多步骤的视觉推理任务时,能否像人类一样推理和决策? 为评估多模态大模型在视觉环境中,完成复杂任务推理的能力。清华大学团队受密室逃脱游戏启发,提出 EscapeCraft:一个3D密室逃脱环境 ,让大模型在3D密室中通过自由探索寻找道具,解锁出口。 该论文目前已入选ICCV 2025。 EscapeCraft 环境 沉浸式互动环境,灵感源自密室逃脱 研究团队打造了可自动生成、灵活配置的 3D 场景 EscapeCraft,模型在里面自由行动:找钥匙、开箱 子、解密码、逃出房间……其中每一步都需整合视觉、空间、逻辑等多模态信息。 任务可扩展,应用无限可能 EscapeCraft以逃出房间为最终目的,重点评测逃脱过程中的探索和决策行为、推理路径等。支持不同房 间风格、道具链长度与难度组合,还可扩展到问答、逻辑推理、叙述重建等任务。它是一个 高度灵活、 可持续迭代的通用评测平台 ,也可以为未来的智能体、多模态推理、强化 ...
野村:全球人工智能趋势追踪专题_ Broadcom‘s Tomahawk 6
野村· 2025-06-23 02:10
Investment Rating - The report does not provide a specific investment rating for the companies mentioned, but it highlights potential beneficiaries of Broadcom's Tomahawk 6 launch, indicating a positive outlook for certain players in the AI networking value chain [1][20]. Core Insights - Broadcom's Tomahawk 6 (TH6) switch chip, launched on June 3, 2025, utilizes 3nm technology and supports 200G SerDes, enhancing networking transmission bandwidth and reducing latency. This launch is expected to drive a new technology upgrade cycle in the global AI infrastructure and networking markets, benefiting companies with advanced technologies [1][36]. - The report identifies key players that may benefit from the TH6 upgrade cycle, including Zhongji InnoLight and Suzhou TFC in the optical transceiver market, Shennan Circuits and WUS PCB in PCB/CCL manufacturing, and Unisplendour in AI server/switch manufacturing [1]. - The report discusses the competitive landscape of AI networking, highlighting the shift from Ethernet to NVIDIA's InfiniBand in large-scale AI data centers, while also noting the emergence of Ultra Ethernet specifications aimed at improving communication efficiencies for scaling out AI clusters [2][20]. Summary by Sections AI Networking Overview - The AI networking market is divided into scale-out and scale-up networks, with Ethernet historically leading in traditional data centers but losing ground to InfiniBand in AI infrastructure deployments [2][11]. - The Ultra Ethernet Consortium has developed specifications to enhance Ethernet's capabilities, aiming to regain momentum in the AI networking space [2][13]. Scale-Up Networking Technologies - NVIDIA's NVLink and the newly developed UALink Consortium's Ultra Accelerator Link are key technologies for scale-up networking, enabling high-speed interconnections between GPUs and AI accelerators [3][24]. - Broadcom's Scale-Up Ethernet (SUE) aims to provide low latency and high bandwidth connectivity for XPU scale-up networks, competing with NVIDIA's NVLink [31][34]. Market Dynamics and Trends - The global AI application landscape shows strong growth, with OpenAI's ChatGPT leading in daily active users (DAU), reaching 110 million in early June 2025 [4][74]. - In China, Bytedance's Doubao has surpassed DeepSeek in DAU, indicating a competitive generative AI application market [5][77]. Competitive Landscape - Broadcom's TH6 switch chip offers a switching capacity of 102.4 Tbps, significantly higher than its predecessors, and is designed to support both scale-up and scale-out architectures [36][41]. - Competitors like Cisco and NVIDIA are also advancing their switch technologies, with Cisco's SiliconOne G200 and NVIDIA's Spectrum series providing strong alternatives in the market [41][42]. Future Outlook - The report anticipates a rapid growth in demand for 1.6T optical modules and data center interconnects driven by the adoption of Broadcom's TH6 [38]. - The overall switch market is projected to grow, with cloud service providers expected to account for a significant portion of data center switch sales by 2027 [52][53].
高盛:中国顶级 AI 应用追踪 -视频生成式 AI 稳定盈利;5 月用户参与度趋势良好
Goldman Sachs· 2025-06-19 09:46
17 June 2025 | 11:17AM HKT Navigating China Internet: Top AI/apps tracker: steady video-generation AI monetization; solid May engagement trends China's top AI applications maintained healthy user engagement trends in May with key investor focuses around 1) Rising use cases scenario: increasing AI adoption across to-C/to-B use cases, demonstrated by continued ramp of AI app engagement led by DeepSeek and Bytedance's Doubao, improving penetration of AI functionalities into existing app portfolios of internet ...
高考数学斩获139分!小米7B模型比肩Qwen3-235B、OpenAI o3
机器之心· 2025-06-16 05:16
机器之心报道 机器之心编辑部 上上周的 2025 高考已经落下了帷幕!在人工智能领域,各家大模型向数学卷发起了挑战。 在 机器之心的测试 中,七个大模型在「2025 年数学新课标 I 卷」中的成绩是这样的:Gemini 2.5 Pro 考了 145 分,位列第一;Doubao 和 DeepSeek R1 以 144 分紧 随其后,并列第二;o3 和 Qwen3 也仅有一分之差,分别排在第三和第四。受解答题的「拖累」,hunyuan-t1-latest 和文心 X1 Turbo 的总成绩排到了最后两名。 其实,向今年数学卷发起挑战的大模型还有其他家,比如 Xiaomi MiMo-VL,一个只有 7B 参数的小模型 。 该模型同样挑战了 2025 年数学新课标 I 卷,结果显示, 总分 139 分,与 Qwen3-235B 分数相同,并只比 OpenAI o3 低一分 。 并且,相较于同样 7B 参数的多模态大模型 Qwen2.5-VL-7B, MiMo-VL 整整高出了 56 分 。 MiMo-VL-7B 和 Qwen2.5-VL-7B 是通过上传题目截图的形式针对多模态大模型进行评测,其余均是输入文本 lat ...
瑞银:中国互联网数据中心行业_豆包 token 使用量增长_曲线愈发陡峭
瑞银· 2025-06-16 03:16
Global Research ab 12 June 2025 First Read Chinese Internet Data Centre Sector Doubao token usage growth: the curve is getting steeper Doubao LLM daily token usage growth is steepening According to Volcengine management during FORCE conference, daily token usage of Doubao reached 16.4tn as of end-May, 137x the level in May-2024. The growth trajectory of daily token usage by Doubao is steepening (Figure 1Douba LMnmerof avgedily tokenusag (bn). Doubao 1.6 is the first LLM to support 256k context window in Chi ...
知识类型视角切入,全面评测图像编辑模型推理能力:所有模型在「程序性推理」方面表现不佳
量子位· 2025-06-13 05:07
KRIS-Bench团队 投稿 量子位 | 公众号 QbitAI 人类在学习新知识时,总是遵循从"记忆事实"到"理解概念"再到"掌握技能"的认知路径。 AI是否也建立了"先记住单词,再理解原理,最后练习应用"的这种知识结构呢? 测评一下就知道了! 东南大学联合马克斯·普朗克信息研究所、上海交通大学、阶跃星辰、加州大学伯克利分校与加州大学默塞德分校的研究团队,共同提出了 KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark)。 首创地 从知识类型的视角 ,对图像编辑模型的推理能力进行系统化、精细化的评测。 借鉴布鲁姆认知分类与教育心理学中的分层教学理念,KRIS-Bench让AI在事实性知识(Factual Knowledge)、概念性知识(Conceptual Knowledge)与程序性知识(Procedural Knowledge)三大层面上,逐步接受更深入、更复杂的编辑挑战。 基于认知分层的三大知识范畴 KRIS-Bench在每个类别下又细化出7大推理维度、22种典型编辑任务,从 "物体计数变化"到"化学反应预测 ...
高考数学全卷重赛!一道题难倒所有大模型,新选手Gemini夺冠,豆包DeepSeek并列第二
机器之心· 2025-06-10 17:56
机器之心报道 编辑:杨文、+0 AI挑战全套高考数学题来了! 话接上回。 高考数学一结束,我们连夜使用六款大模型产品,按照一般用户截图提问的方式,挑战了 14 道最新高考客观题,不过有网友质疑测评过程不够严 谨,所以这次我们加上解答题,重新测一遍。 本次参加挑战的选手分别是:Doubao-1.5-thinking-vision-pro、DeepSeek R1、Qwen3-235b、hunyuan-t1-latest、文心 X1 Turbo、o3,并且新增网友们非常期待的 Gemini 2.5 pro。上一次我们使用网页端测试,这次除 o3 外,其他模型全部调用 API。 在考题选择上,我们仍然采用 2025 年数学新课标 Ⅰ 卷,包含 14 道客观题,总计 73 分;5 道解答题,总计 77 分。其中第 6 题由于涉及到图片,我们就单独摘出 来,后面通过上传题目截图的形式针对多模态大模型进行评测。其他文本题目全部转成 latex 格式,分别投喂给大模型,还是老规矩,不做 System Prompt 引导, 不开启联网搜索,直接输出结果。 (注:第 17 题虽然也涉及到图片,但文字表述足够清晰,不影响答题,因此 ...
多模态模型挑战北京杭州地铁图!o3成绩显著,但跟人类有差距
量子位· 2025-06-07 05:02
ReasonMap团队 投稿 量子位 | 公众号 QbitAI 近年来,大语言模型(LLMs)以及多模态大模型(MLLMs)在多种场景理解和复杂推理任务中取得突破性进展。 然而,一个关键问题仍然值得追问: 多模态大模型(MLLMs),真的能"看懂图"了吗? 特别是在面对结构复杂、细节密集的图像时,它们是否具备细粒度视觉理解与空间推理能力,比如挑战一下高清 地铁图 这种。 为此,来自西湖大学、新加坡国立大学、浙江大学、华中科技大学的团队提出了一个全新的评测基准 ReasonMap 。 看得出来北京、杭州的地铁图难倒了一大片模型。 这是首个聚焦于 高分辨率交通图(主要为地铁图)的多模态推理评测基准,专为评估大模型在理解图像中细粒度的结构化空间信息 方面的 能力而设计。 结果发现,当前主流开源的多模态模型在ReasonMap上面临明显性能瓶颈,尤其在 跨线路路径规划 上常出现视觉混淆或站点遗漏。 而经强化学习后训练的闭源推理模型(如 GPT-o3)在多个维度上 显著优于 现有开源模型,但与人类水平相比仍存在明显差距。 在面对不同国家地区的地铁图中,四个代表性 MLLM(Qwen2.5-VL-72B-I(蓝色)、 I ...
DeepSeek新版R1模型实际性能如何?第三方评测来了
Nan Fang Du Shi Bao· 2025-06-05 12:26
Core Insights - DeepSeek has released an upgraded version of its R1 model, which shows improved performance compared to its predecessor and surpasses OpenAI's o3 model, although it still lags behind o4-mini(high) and Google's Gemini 2.5 Pro Preview 05-06 [1][2] Model Performance - The new R1 model achieved a total score of 63.55, an increase of 1.61 points from the previous version, placing it fourth in the rankings [2] - The highest score was obtained by o4-mini(high) at 70.51, followed by Gemini 2.5 Pro preview 05-06 at 66.48 [2] Reasoning and Instruction Following - The instruction-following capability of the new R1 model improved significantly, scoring 48.46, which is 17.09 points higher than the old version, but still falls short of international top models like o3 (66.95) and o4-mini(high) (68.07) [4] - The reasoning task scores showed a decline of 1.7 points compared to the old R1 model, with the main differences observed in mathematical and scientific reasoning tasks, while performing better in coding tasks [4] Reduction in Hallucination Rate - The updated R1 model has optimized its performance regarding "hallucination" issues, with a reduction in hallucination rates by approximately 45%-50% in tasks such as rewriting, summarization, and reading comprehension [4] - The hallucination rate for the new R1 model is now at 13.86%, a decrease of 7.16 percentage points, although it still has a significant gap compared to the best-performing model, doubao-1.5-pro-32k, which has a hallucination rate of only 4.11% [5] - The most notable improvements in hallucination rates were observed in text summarization and reading comprehension tasks, with reductions of 9.27% and 14.49%, respectively [5]