量子位

Search documents
 超越英伟达Describe Anything!中科院 & 字节联合提出「GAR」,为DeepSeek-OCR添砖加瓦
 量子位· 2025-10-28 05:12
 Core Insights - The article discusses the innovative approach "Vision as Context Compression" proposed by DeepSeek-OCR, focusing on using OCR capabilities to compress documents through images [1] - The collaboration between the Chinese Academy of Sciences and ByteDance introduces "Grasp Any Region" (GAR), which explores the potential of natural images as a means of text compression [2] - GAR's precise region captioning capability is highlighted as a potential pathway for constructing dense captions for natural images [4]   Summary by Sections  GAR Capabilities - GAR possesses three main abilities: accurately describing user-specified regions, modeling relationships between multiple regions, and performing complex combinatorial reasoning [5][7] - The model allows users to provide various visual prompts and instructions for precise understanding of specific regions [9][10]   Importance of Region MLLMs - Region MLLMs differ from traditional MLLMs by enabling fine-grained, interactive understanding of image/video content [8] - The article emphasizes the challenge of evaluating full-image captions, while region captions can be objectively assessed based on color, texture, shape, and material [12]   Trade-off Between Local and Global Information - The article discusses the dilemma faced by Region MLLMs in balancing local details and global context [15] - Examples are provided to illustrate how GAR outperforms other models like DAM in accurately identifying and describing specified regions [18][19]   Model Design and Mechanism - GAR's design follows the principle of achieving fine-grained understanding while retaining global context [39] - The introduction of a lightweight prompt encoding mechanism and RoI-Aligned Feature Replay allows for high-fidelity feature extraction from specified regions [46][49]   Data Pipeline and Training - The training process involves multiple stages to enhance recognition capabilities and support multi-region associative reasoning [57][59][61] - The creation of GAR-Bench aims to systematically evaluate the region-level understanding capabilities of multimodal large language models (MLLMs) [64]   Performance Evaluation - GAR models demonstrate superior performance in various benchmark tests, achieving high scores in both single-region and multi-region understanding tasks [71][74] - The results indicate GAR's effectiveness in generating rich, accurate, and detailed local descriptions, establishing it as a state-of-the-art solution [77]   Zero-shot Transfer to Video Tasks - GAR's capabilities extend to video tasks, showing strong performance in zero-shot settings, even surpassing models specifically trained for video [79] - The article concludes with the potential applications of GAR in training multimodal understanding models and enhancing complex text instruction adherence [80][81]
 VAE再被补刀!清华快手SVG扩散模型亮相,训练提效6200%,生成提速3500%
 量子位· 2025-10-28 05:12
 Core Viewpoint - The article discusses the transition from Variational Autoencoders (VAE) to new models like SVG developed by Tsinghua University and Kuaishou, highlighting significant improvements in training efficiency and generation speed, as well as addressing the limitations of VAE in semantic entanglement [1][4][10].   Group 1: VAE Limitations and New Approaches - VAE is being abandoned due to its semantic entanglement issue, where adjusting one feature affects others, complicating the generation process [4][8]. - The SVG model achieves a 62-fold improvement in training efficiency and a 35-fold increase in generation speed compared to traditional methods [3][10]. - The RAE approach focuses solely on enhancing generation performance by reusing pre-trained encoders, while SVG aims for multi-task versatility by constructing a feature space that integrates semantics and details [11][12].   Group 2: SVG Model Details - SVG utilizes the DINOv3 pre-trained model for semantic extraction, effectively distinguishing features of different categories like cats and dogs, thus resolving semantic entanglement [14]. - A lightweight residual encoder is added to capture high-frequency details that DINOv3 may overlook, ensuring a comprehensive feature representation [14]. - The distribution alignment mechanism is crucial for maintaining the integrity of semantic structures while integrating detail features, as evidenced by a significant increase in FID values when this mechanism is removed [15][16].   Group 3: Performance Metrics - In experiments, SVG outperformed traditional VAE models in various metrics, achieving a FID score of 6.57 on the ImageNet dataset after 80 epochs, compared to 22.58 for the VAE-based SiT-XL [18]. - The model's efficiency is further demonstrated with a FID score dropping to 1.92 after 1400 epochs, nearing the performance of top-tier generative models [18]. - SVG's feature space is versatile, allowing for direct application in tasks like image classification and semantic segmentation without the need for fine-tuning, achieving an 81.8% Top-1 accuracy on ImageNet-1K [22].
 华为世界模型来了!单卡30分钟生成272㎡场景
 量子位· 2025-10-28 05:12
闻乐 发自 凹非寺 量子位 | 公众号 QbitAI AI大house真来了。 华为联合上海交通大学、华中科技大学推出了世界模型 WordGrow ,可以生成 1800㎡ 超大室内场景 (19x39块) ,单卡30分钟就跑了 272㎡。 并且场景具备连贯的几何拓扑和照片级真实感外观,智能体的导航则是在复杂空间布局中自主规划路径。 里面的虚拟人还能顺畅导航,不带迷路的。 (小声说:大平层确实需要导航) 还有一些方法最多只能造单个房间,扩展成套房就卡壳; 这可以说是走到哪建到哪了,那场景是怎么搭的? 具备连贯的几何拓扑和照片级真实感外观 以前想造个像样的3D大场景,坑不少。 比如有些技术先靠2D模型画张图,再硬掰成3D,结果换完视角一看,沙发腿歪了、墙壁纹理断了…… 更离谱的是没有布局逻辑——出现冰箱塞进卧室,床摆在厨房的情况。 现在,WorldGrow来搞装修了(bushi),用三个核心技术填坑。 第一步是先做 数据精准预处理 ,从3D-FRONT这类大规模数据集里提取优质样本,用Blender执行场景切片,通过布尔交集对场景进行区块 切分,再靠occupancy检测确保区块内容密度 (可见内容≥95%) 。  ...
 人工智能年度榜单火热报名中!五大奖项,寻找AI+时代的先锋力量
 量子位· 2025-10-28 05:12
组委会 发自 凹非寺 量子位|公众号 QbitAI 为了让更多从业者感受智能浪潮的跃迁,也为了给予更多同行同路人掌声与鼓舞,我们将正式启动 「2025人工智能年度榜单」评选报名 。 本次评选将从 企业 、 产品 、 人物 三大维度,设立五类奖项。欢迎企业踊跃报名! 让我们共同见证年度之星,点亮未来的方向。 企业榜 产品榜 人物榜 2025 人工智能年度 焦点人物 详细评选标准及报名方式如下。 2025 人工智能年度领航企业 将面向中国人工智能领域,评选出最具综合实力的企业, 参选条件 : 2025 人工智能年度 领航企业 2025 人工智能年度 潜力创业公司 2025 人工智能年度 杰出产品 2025 人工智能年度 杰出解决方案 1、注册地在中国,或主营业务主要面向中国市场; 2、主营业务属于人工智能及相关产业,或已将人工智能广泛应用于主营业务,并在细分领域居于行业领先地位; 3、具备成熟的产品或服务,已获得实际客户应用及市场认可; 4、近一年在技术创新、产品落地、市场拓展或商业模式上取得显著突破。 评选标准 : 2025 人工智能年度潜力创业公司 聚焦于中国人工智能领域创新创业力量,将评选出最具投资价值和发 ...
 两大数学奖项同时颁给王虹!北大三校友包揽“华人菲尔兹”
 量子位· 2025-10-28 05:12
梦晨 鹭羽 发自 凹非寺 量子位 | 公众号 QbitAI 两项重量级数学大奖同天颁发,获奖者中都有 王虹 。 国际数学界的2025 塞勒姆奖 ,颁发给 王虹和Vesselin Dimitrov 。 塞勒姆奖被视为菲尔兹奖的风向标,据统计从1968年到2024年的56位塞勒姆奖获奖者中诞生了10位菲尔兹奖得主。例如陶哲轩在2000年获 得塞勒姆奖,2006年就获得了菲尔兹奖。 在获奖名单公布之后,陶哲轩也是第一时间发文恭喜。 另一边则是 世界华人数学家大会ICCM数学奖金奖 ,颁给王虹、邓煜、袁新意,三人都是北大数院校友。 世界华人数学家大会丘成桐发起,每三年举办一次,像菲尔兹奖一样限定45岁以下,也被称为"华人菲尔兹奖"。 从半路转专业到顶级数学机构终身教授 王虹的履历堪称学霸中的学霸。她原本在北大 地球与空间科学学院 就读,后来因为对数学的热爱转到了数学。 2011年从北大毕业后,她前往法国巴黎综合理工学院深造,又在巴黎第十一大学获得硕士学位。 2019年,她在MIT完成博士学位,导师是著名数学家Larry Guth。之后在普林斯顿高等研究院完成博士后研究,2021年加入加州大学洛杉矶 分校担任助理教授。 ...
 全球开源大模型杭州霸榜被终结,上海Minimax M2发布即爆单,百万Tokens仅需8元人民币
 量子位· 2025-10-28 01:18
 Core Insights - The open-source model throne has shifted to Minimax M2, surpassing previous leaders DeepSeek and Qwen, which were based in Hangzhou, now replaced by the Shanghai-based Minimax [1]   Performance and Features - Minimax M2 achieved a score of 61 in the Artificial Analysis test, ranking it as the top open-source model, just behind Claude 4.5 Sonnet [2] - The model is designed specifically for agents and programming, showcasing exceptional programming capabilities and agent performance [4] - Minimax M2 is economically efficient, with a reasoning speed twice that of Claude 3.5 Sonnet, while its API pricing is only 8% of Claude's [5][9] - The model's total parameter count is 230 billion, with only 10 billion active parameters, allowing for rapid execution [9][10] - It employs an interleaved thinking format, crucial for planning and verifying operations across multiple dialogues, enhancing agent reasoning [11]   Comparative Analysis - In the overall performance ranking, M2 placed fifth in the Artificial Analysis test, securing the top position among open-source models [14] - The test utilized ten popular datasets, including MMLU Pro and LiveCodeBench, to evaluate model performance [15] - M2's pricing is set at $0.3 per million input tokens and $1.2 per million output tokens, representing only 8% of Claude 3.5 Sonnet's cost [16]   Agent Capabilities - Minimax has deployed M2 on an agent platform for limited free use, showcasing various existing projects created with the model [32][35] - The platform allows users to create diverse web applications and even replicate classic games in a web environment [36][38] - Users have successfully developed projects like an online Go game platform, demonstrating M2's programming capabilities [40][43]   Technical Insights - M2 utilizes a hybrid attention mechanism, combining full attention and sliding window attention, although initial plans to incorporate sliding window attention were abandoned due to performance concerns [45][46] - The choice of attention mechanism reflects Minimax's strategy to optimize performance for long-range dependency tasks [49][54]
 Thinking Machine新研究刷屏!结合RL+微调优势,小模型训练更具性价比了
 量子位· 2025-10-28 01:18
 Core Insights - The article discusses the innovative research by Thinking Machine, focusing on a new training method for small language models called On-Policy Distillation, which enhances their understanding of specialized fields [1][4].   Summary by Sections  Methodology - On-Policy Distillation combines the strengths of two traditional training methods: reinforcement learning (self-exploration) and supervised fine-tuning (direct answers), creating a more efficient training framework [3][8]. - This method allows AI to learn through practical problem-solving while receiving immediate guidance when it encounters difficulties, significantly improving training efficiency by 50-100 times [4][5].   Training Phases - The training process consists of three main phases: Pre-training (general capabilities), Mid-training (domain-specific knowledge), and Post-training (target behavior guidance) [9]. - The focus of the research is on the Post-training phase, where the model learns to perform specific tasks effectively [6][9].   Evaluation Metrics - The method employs Negative reverse KL divergence as a key evaluation metric, ensuring that the student model learns effectively by minimizing the divergence from the teacher model's expectations [12][15].   Experimental Results - Experiment 1 demonstrated that using On-Policy Distillation, a smaller model (8B) could achieve a performance score of 70% on a math benchmark with significantly lower computational costs compared to traditional methods [19][22]. - Experiment 2 showed that the method effectively mitigates "catastrophic forgetting" in AI models, allowing them to retain general capabilities while learning new knowledge [23][25].   Implications - The research indicates that On-Policy Distillation can empower resource-constrained individuals or small companies to train effective specialized models, enhancing accessibility in AI development [5][19]. - The findings suggest a promising avenue for achieving lifelong learning in AI systems, addressing the challenge of balancing new knowledge acquisition with the retention of existing skills [26].
 微调已死!「共识机制」实现提示词自我进化,性能飙升
 量子位· 2025-10-28 01:18
西湖大学MAPLE实验室 投稿 量子位 | 公众号 QbitAI 当前,人工智能领域正经历一场由「模型微调」向「上下文工程」的范式转变。 通过在输入中引入更明确的指令和更丰富详实的知识,「上下文工程」既无需投入高昂的训练成本,亦不依赖开源模型权重参数,同时能够 为用户和开发者提供更强的可解释性,正逐渐成为构建高性能、可扩展且具备自我改进能力的 AI 系统的核心范式。 正因如此,「微调已死」成为了AI领域近期广泛认可的热门话题。 对于这一缺陷,多提示词的相互协作是一个很自然的解决方案——单个提示词可能无法处理特定输入,但其他提示词可以弥补这一方面的性 能损失。 如果能基于多个提示词生成的回答提取他们所达成的「共识」,AI系统就更有可能输出正确答案。 基于这一思想,西湖大学MAPLE实验室齐国君教授团队提出了基于「共识机制」的提示词组进化算法C-Evolve。 与既往仅优化单一提示词不同,C-Evolve旨在通过进化算法生成一组提示词。该组提示词在对输入信息进行独立处理后,通过提取所有输 出结果的共识,以实现最优任务性能。 为实现这一目标,团队创新性地提出了「共识表决得分」这一进化指标,用于评估单个提示词在成组工 ...
 比尔盖茨女儿也AI创业了!时尚电商,刚被塞了800万美元投资
 量子位· 2025-10-27 08:26
 Core Viewpoint - Phoebe Gates and Sophia Kianni's startup, Phia, has successfully raised $8 million in seed funding to innovate online shopping through AI technology, attracting notable investors from the entertainment industry [6][7][8].   Company Overview - Phia is an AI-driven shopping assistant launched in April 2023, designed to help users compare prices of new and second-hand items in real-time [12][14]. - The application has gained over 600,000 users within six months of its launch [13]. - Phia's database connects with top resale platforms, covering over 250 million items [20].   Funding and Growth - The $8 million funding will be utilized to build a world-class team in engineering, AI research, product development, and marketing [7]. - The company has quickly established a presence on over 40,000 shopping websites and has partnered with more than 5,000 brands [22].   Market Context - The global e-commerce sales are projected to grow from approximately $0.6 trillion in 2010 to about $6.4 trillion by 2025, indicating a tenfold increase [32]. - Despite the growth in online shopping, the technology and user experience have stagnated, leading to a demand for more efficient shopping solutions [30][35].   Founders' Background - Phoebe Gates and Sophia Kianni met as roommates at Stanford University and decided to address the common issue of shopping anxiety through their startup [41][47]. - Sophia Kianni has a notable background in climate activism and was appointed as a youth advisor to the UN at the age of 18 [63][66]. - Phoebe Gates, the youngest daughter of Bill Gates, aims to establish her own identity and success outside of her family's legacy [75][81].
 零一万物高管新阵容亮相,李开复加码布局ToB 2.0
 量子位· 2025-10-27 08:26
 Core Viewpoint - The company is accelerating its ToB strategy implementation, transitioning from a product-oriented approach to a systematic operation model [1][14].   Leadership Changes - The company announced a new round of executive appointments, including co-founder Shen Pengfei, VP of AI Models and Professional User Products Zhao Binqiang, and VP of International Business and AI Consulting Ning Ning, forming a three-dimensional synergy in market and sales, model and technology, and international consulting [2][4][13]. - Shen Pengfei will oversee domestic ToB and ToG business expansion, leveraging his 26 years of IT and internet experience to drive AI solution delivery [5][6]. - Zhao Binqiang, with 17 years in internet algorithms and AI, will lead the core algorithm development and professional user product lines, contributing to the company's strategic ToB business [8][13]. - Ning Ning will focus on global business expansion and AI consulting, implementing AI strategies in key projects across multiple countries [10][11].   Strategic Framework - The "One Leader Project" is emphasized as essential for AI transformation, requiring direct involvement from the CEO to integrate AI into core processes [3][15]. - The company's self-developed "Wanzhi" enterprise model platform has been upgraded to version 2.0, supporting customized enterprise-level agents and multi-industry applications [17][21]. - The platform has been deployed across five major industries, with over 30 types of "super employee" AI agents, aiming to create a new foundation for enterprise AI operations [18][20].   Market Positioning - The strategic goal is to make AI capabilities replicable and scalable, achieving a closed-loop delivery system for enterprise-level AI [20][21]. - The company has established lighthouse projects with leading clients in China and launched an ecosystem partnership program to create multi-scenario solutions [22]. - Internationally, the collaboration with Kazakhstan on the AlemLLM language model exemplifies the company's commitment to AI cooperation along the Belt and Road Initiative [23].   Future Outlook - The company aims to leverage AI agents as a breakthrough point, promoting AI as a driver of enterprise transformation and extending its innovative capabilities to more countries and regions [24][25].










