量子位

Search documents
 智谱运气是差一点点,视觉Token研究又和DeepSeek撞车了
 量子位· 2025-10-22 15:27
 Core Viewpoint - The article discusses the competition between Zhipu and DeepSeek in the AI field, particularly focusing on the release of Zhipu's visual token solution, Glyph, which aims to address the challenges of long context in large language models (LLMs) [1][2][6].   Group 1: Context Expansion Challenges - The demand for long context in LLMs is increasing due to various applications such as document analysis and multi-turn dialogues [8]. - Expanding context length significantly increases computational costs; for instance, increasing context from 50K to 100K tokens can quadruple the computational consumption [9][10]. - Merely adding more tokens does not guarantee improved model performance, as excessive input can lead to noise interference and information overload [12][14].   Group 2: Existing Solutions - Three mainstream solutions to the long context problem are identified:   1. **Extended Position Encoding**: This method extends the existing position encoding range to accommodate longer inputs without retraining the model [15][16].   2. **Attention Mechanism Modification**: Techniques like sparse and linear attention aim to improve token processing efficiency, but do not reduce the total token count [20][21].   3. **Retrieval-Augmented Generation (RAG)**: This approach uses external retrieval to shorten inputs, but may slow down overall response time [22][23].   Group 3: Glyph Framework - Glyph proposes a new paradigm by converting long texts into images, allowing for higher information density and efficient processing by visual language models (VLMs) [25][26]. - By using visual tokens, Glyph can significantly reduce the number of tokens needed; for example, it can represent the entire text of "Jane Eyre" using only 80K visual tokens compared to 240K text tokens [32][36]. - The training process for Glyph involves three stages: continual pre-training, LLM-driven rendering search, and post-training, which collectively enhance the model's ability to interpret visual information [37][44].   Group 4: Performance and Results - Glyph achieves a token compression rate of 3-4 times while maintaining accuracy comparable to mainstream models [49]. - The implementation of Glyph results in approximately four times faster prefill and decoding speeds, as well as two times faster supervised fine-tuning (SFT) training [51]. - Glyph demonstrates strong performance in multimodal tasks, indicating its robust generalization capabilities [53].   Group 5: Contributors and Future Implications - The primary author of the paper is Jiale Cheng, a PhD student at Tsinghua University, with contributions from Yusen Liu, Xinyu Zhang, and Yulin Fei [57][62]. - The article suggests that visual tokens may redefine the information processing methods of LLMs, potentially leading to pixels replacing text as the fundamental unit of AI input [76][78].
 清华联手英伟达打造扩散模型新蒸馏范式!视频生成提速50倍,4步出片不穿模
 量子位· 2025-10-22 09:12
 Core Insights - The article discusses a new distillation paradigm called rCM that significantly enhances video generation speed by up to 50 times while maintaining high quality and diversity in the generated content [4][20][33]   Group 1: Introduction of rCM - rCM is a novel large-scale diffusion model distillation paradigm developed by Tsinghua University and NVIDIA, which successfully extends continuous time consistency distillation to billion-parameter models [5][9] - The method addresses bottlenecks in existing approaches, particularly in real-world applications involving large-scale text-to-image and text-to-video models [3][9]   Group 2: Technical Innovations - The rCM framework introduces a forward-reverse divergence joint optimization approach, which enhances inference speed while ensuring high-quality and diverse generation results [4][11] - By utilizing self-developed FlashAttention-2 JVP CUDA operators and compatible distributed training strategies, rCM successfully applies continuous time consistency distillation to leading models like Cosmos and Wan2.1 [13][18]   Group 3: Performance Metrics - rCM demonstrates exceptional performance across various large-scale text-to-image and text-to-video tasks, compressing the sampling process from hundreds of steps to an impressive 1-4 steps, achieving a speedup of 15-50 times [20][21] - In evaluations, the rCM model matches or even surpasses the performance of teacher models that require hundreds of sampling steps [21][25]   Group 4: Quality and Diversity - The rCM model effectively addresses the quality shortcomings of previous models by incorporating reverse divergence as a regularization term, allowing it to maintain high diversity while improving quality [19][22] - Compared to previous state-of-the-art distillation methods, rCM exhibits significantly higher diversity in generated video content, effectively avoiding "mode collapse" issues [25][31]   Group 5: Future Applications - rCM is expected to be widely applied in NVIDIA's Cosmos series of world models, indicating its potential for broader industry adoption [34]
 KTransformers入选计算机系统顶会、与主流框架合作,趋境&清华让「异构」成为推理新范式
 量子位· 2025-10-22 09:12
允中 发自 凹非寺 量子位 | 公众号 QbitAI 全球AI基础设施快速演进的浪潮中,一个诞生自中国的开源项目,正在被世界看见。 它就是 KTransformers,由趋境科技与清华大学KVCache.AI团队联合研发,聚焦大模型推理阶段的系统创新。 这是一个高性能异构推理框架,专注于高效利用底层GPU、CPU、内存等多样化算力,让大模型在更低算力、更灵活的硬件架构上高效运 行,项目论文《KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models》入选了刚刚落幕的 "计 算机系统领域奥斯卡" SOSP 2025 。 SOSP是计算机系统领域最具影响力的国际顶会之一。过去几十年间,从虚拟化到分布式文件系统,无数里程碑式的技术成果都曾首次亮相于 此。 如今,KTransformers也在这个舞台上获得了全球系统学术界的最高背书。 几乎在同一时间,KTransformers宣布与主流推理框架SGLang合作,双方架构合入同一分支。这次合作意味着全GPU推理与异构推理的融 合,推动大模型推理架构变得更 ...
 人工智能年度榜单火热报名中!五大奖项,寻找AI+时代的先锋力量
 量子位· 2025-10-22 09:12
组委会 发自 凹非寺 量子位|公众号 QbitAI 为了让更多从业者感受智能浪潮的跃迁,也为了给予更多同行同路人掌声与鼓舞,我们将正式启动 「2025人工智能年度榜单」评选报名 。 本次评选将从 企业 、 产品 、 人物 三大维度,设立五类奖项。欢迎企业踊跃报名! 让我们共同见证年度之星,点亮未来的方向。 企业榜 产品榜 人物榜 2025 人工智能年度 焦点人物 详细评选标准及报名方式如下。 聚焦于中国人工智能领域创新创业力量,将评选出最具投资价值和发展潜力的AI创业公司, 参选条件 : 评选标准 : 2025 人工智能年度领航企业 将面向中国人工智能领域,评选出最具综合实力的企业, 参选条件 : 2025 人工智能年度 领航企业 2025 人工智能年度 潜力创业公司 2025 人工智能年度 杰出产品 2025 人工智能年度 杰出解决方案 1、注册地在中国,或主营业务主要面向中国市场; 2、主营业务属于人工智能及相关产业,或已将人工智能广泛应用于主营业务,并在细分领域居于行业领先地位; 评选标准 : 2025 人工智能年度潜力创业公司 3、具备成熟的产品或服务,已获得实际客户应用及市场认可; 4、近一年在技术 ...
 腾讯开源混元世界模型1.1,视频秒变3D世界,单卡推理仅需1秒
 量子位· 2025-10-22 09:12
允中 发自 凹非寺 量子位 | 公众号 QbitAI 腾讯混元世界模型再放大招!刚刚发布并开源 混元世界模型1.1 (WorldMirror) ——真正统一的端到端3D重建基座大模型。 它 首次 支持用户从多视图或视频中一键生成3D世界,还能在单卡、秒级推理下完成高精度重建。 同时,混元世界模型1.1也是业界 首个 统一 (any-to-any) 的前馈式 (feedforward) 3D重建大模型。 不仅支持额外的相机、深度等多模态先验输入,还能同时实现点云、深度、相机、表面法线和新视角合成等多任务统一输出,性能达成新的 SOTA 。 效果上,无论是3D点云重建还是端到端3DGS重建,混元世界模型1.1都展现出领先同行的几何精度和细节还原,可以实现更稳定、更真实的 场景重建。 动画风格的虚拟场景,对它来说so easy,嗖地一下就置身欧洲小镇街角,感觉下一秒就可以游戏跑图了 (doge) 中华风也不在话下,背景里的石灯、房梁都还原度拉满。 首先体验一下此次混元世界模型1.1的生成效果: 真实的航拍场景也满满都是细节,妥妥的景区宣传vlog,即拿即用,这下谁还能分清是不是AI生成的~ 那么话不多说,下面我们一 ...
 全球首款!高性能人形机器人跑跳进入万元机时代
 量子位· 2025-10-22 09:12
人形机器人卷来卷去,终于有一台,是卖给我们家用的了。 千元价格,身高不到一米,能跑、能跳、能陪玩。 梦瑶 发自 凹非寺 量子位 | 公众号 QbitAI 接下来,咱就一起看看Bumi到底还有多少隐藏技能! 万元以内,带回家一个能跑能调教的机器人 说到人形机器人,大家的第一印象可能还是那种展会上踉踉跄跄走几步、挥挥手的小表演。 或者在某些公众号视频号里来一段"花式炫技"~ 这就 是全球首款 万元以内 高性能人形机器人 —— Bumi 。 看这跳舞的小姿势,节奏到位,跳得还那叫一个松弛自信! 体型也刚刚好,12kg的小身板,拎起来一点负担都没有,so easy! 总感觉离自己的生活很远,再加上动辄三五万一台,压根不是普通人能随便玩得起的东西… △ 图片展示的为小布米Bumi原型机,敬请期待最终产品 但Bumi不太一样。 它是第一个真正 面向C端的家用人形机器人 ,把价格直接打到了 万元以内 —— 这个定价,在行业里还是头一回。 以前看这种东西的心路历程是——"买不起啊买不起。"(T^T) 现在则变成了:"诶,好像…我也能用得起了???"(搓搓手) 你可以把Bumi理解成一台"会走路的编程老师+会跳舞的陪玩搭子"。 ...
 汇报一下ICCV全部奖项,恭喜朱俊彦团队获最佳论文
 量子位· 2025-10-22 05:48
时令 发自 凹非寺 量子位 | 公众号 QbitAI 刚刚,备受瞩目的ICCV 2025,在美国夏威夷正式"开奖"! 好家伙,在提交论文的作者里,中国直接占了半壁江山,不多不少占比50%。 各个奖项也是重中之重捷报频传,现场更是人山人海……好在前方参会的"詹姆斯邦迪" (小红书博主,欢迎大家去follow) ,第一时间分享了最 新进展。 让我们一起膜拜看看顶会荣耀,今年花落谁家? 最佳论文奖(马尔奖) : △ 图源小红书博主:@詹姆斯邦迪 最佳论文荣誉提名 : Spatially-Varying Autofocus(空间可变自动对焦)。 Generating Physically Stable and Buildable Brick Structures from Text(从文本生成物理稳定且可搭建的积木结构)。 △ 图源小红书博主:@詹姆斯邦迪 最佳学生论文奖 : FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models(基于预训练流模型的无反演文本编辑方法) 。 △ 图源小红书博主:@詹姆斯邦迪 最佳学生论文荣誉提 ...
 Qwen深度研究一夜升级!可生成网页和音频播客,新模型能认医生手写体
 量子位· 2025-10-22 05:48
梦晨 发自 凹非寺 量子位 | 公众号 QbitAI Qwen版深度研究加速进化,一觉起来增加了听觉和视觉输出: 可以生成网页和音频了。 AI深度研究整合的内容,变成图文并茂的网页,还可以一键部署,任何人可凭链接查看,方便对外展示。 与此前火爆的NoteBookLM相比,深度研究作为输入还省去了给AI提供内容的环节。 改进产品功能的同时,Qwen团队也在不断更新背后的模型。 最新版视觉语言模型Qwen3 VL 甚至可以识别地狱难度的医生手写体。 实测新版Qwen深度研究 加上OpenAI新出的ChatGPT Atlas,AI浏览器这个品类已经有不少产品出现了。 那么该如何挑选呢?这个活就非常适合让深度研究产品来干。 打开深度研究功能,默认会选择最强的Qwen3-Max模型。 它不会直接闷头就开干,而是先向用户确认具体意图。 长篇文字内容也可以变成音频播客,方便自己在碎片时间消化吸收。 得到确认以后,智能体会开始分布操作,总共耗时6分钟。 完成后会得到一份传统的AI文字回复,以及可下载的PDF文件。 | 特 Perplexity Comet | The Browser Company | OpenAI AI | ...
 中国数学家再中数学四大刊,兰州大学首篇:突破斯托克斯方程“光滑性”限制
 量子位· 2025-10-22 05:48
鱼羊 闻乐 发自 凹非寺 量子位 | 公众号 QbitAI 兰州大学刚出了篇数学四大刊! 作者是 兰州大学耿俊教授 和 西湖大学申仲伟教授 ,论文已经被 Inventiones mathematicae(《数学新进展》) 接收。 《数学年刊》、《数学学报》、《数学新进展》和《美国数学会杂志》并称为数学四大刊,是国际数学界公认的数学顶级期刊,每年中国研究 机构中选论文经常不超过10篇。 这项研究,围绕的是流体力学的重要基础之一:斯托克斯方程。 具体来说,是研究斯托克斯算子在非光滑区域里的无穷范数预解估计。 ——别慌,咱浅浅地做个简化翻译,大概意思就是,两位数学家想搞清楚,在边界不那么规则的空间里,比如自然河道而非光滑的管道中,流 体运动相关的数学方程解的范围和规律。 可以理解为,是为斯托克斯方程在相当广的范围内找到了更通用的数学规律。 这也是兰州大学首篇数学四大。 揭示斯托克斯方程在非光滑域内更普适规律 两位数学家瞄准的是流体力学理论里的一个关键缺口: 描述粘稠流体运动的斯托克斯方程,在非光滑边界空间里,流体的速度和压力还没有找到可靠的最大值约束规律。 $$\left\{\begin{array}{ll}-\D ...
 OpenAI首款ChatGPT浏览器发布!现在就能免费下载使用
 量子位· 2025-10-21 23:50
 Core Viewpoint - OpenAI has launched ChatGPT Atlas, an AI-native browser that integrates ChatGPT's capabilities directly into the browsing experience, aiming to redefine how users interact with the web and search for information [1][7][11].   Group 1: Features of ChatGPT Atlas - Each tab in the Atlas browser integrates ChatGPT for direct conversation, allowing users to ask questions about the current webpage without needing to switch tabs or copy-paste [12][14]. - The browser includes a context-aware assistant that can provide tailored responses based on the content being viewed, enhancing user interaction [14]. - A memory feature allows ChatGPT to remember key information from previous browsing sessions, enabling users to retrieve relevant data without re-explaining context [15][17]. - The "Cursor Chat" function enables users to select text and have ChatGPT edit or rewrite it, improving efficiency in tasks like email replies and report organization [18]. - The Agent Mode allows ChatGPT to perform a series of tasks on behalf of the user, such as research, form filling, and making reservations, streamlining the browsing experience [20][22].   Group 2: Strategic Intent and Market Positioning - The launch of the Atlas browser is seen as a strategic move to directly compete with Google, especially with the anticipated release of Gemini 3, which may reshape browser functionalities [32][33]. - OpenAI aims to establish a new traffic entry point and redefine search and advertising models, moving away from traditional keyword-based searches to a conversational interface [34][35]. - The introduction of a subscription model for the Agent features indicates a shift towards a new business model centered around browser and agent integration, potentially aligning with existing app ecosystems [36][38].   Group 3: Industry Implications - The development of the ChatGPT Atlas browser signifies a transformation in browser functionality from simple web navigation to a platform for intelligent assistance and task automation [38][39]. - The evolution of AI capabilities from passive recommendations to active execution of tasks marks a significant trend, impacting various sectors such as e-commerce, travel, and financial services [39][40].










