量子位
Search documents
多模态大模型首次实现像素级推理!3B参数超越72B传统模型,NeurIPS 2025收录
量子位· 2025-10-16 06:11
Core Insights - The article discusses the introduction of UniPixel, a unified pixel-level multimodal model developed by a research team from Hong Kong Polytechnic University and Tencent ARC Lab, which aims to enhance visual reasoning capabilities in AI systems [2][4]. Group 1: Model Overview - UniPixel is designed to perform three major tasks: referring, pixel-level segmentation, and reasoning, all within a single model, showcasing flexibility, precision, and scalability [4][8]. - The model has been accepted for presentation at NeurIPS 2025, and its code, data, and demo are fully open-sourced [5]. Group 2: Technical Innovations - UniPixel redefines visual reasoning by addressing the limitations of traditional visual question-answering systems, which often lack precise perception of specific areas or targets within images [8][9]. - The model incorporates an "Object Memory Bank" and supports three types of visual prompts (point, box, mask), enabling a comprehensive "perception-memory-reasoning" process [9][12]. Group 3: Architecture and Functionality - The architecture of UniPixel is based on the Qwen2.5-VL model, allowing it to process various inputs, including images, videos, and text prompts, and generate natural language responses along with spatial-temporal masks [12][14]. - Key components include a Prompt Encoder for unified encoding of visual prompts, an Object Memory Bank for storing user-specified targets, and a Mask Decoder for generating precise temporal masks [19][21]. Group 4: Training and Evaluation - The training process for UniPixel involved a modular and phased strategy, utilizing approximately 1 million samples across various datasets to enhance its adaptability to different tasks [28][29]. - Extensive experiments were conducted on 10 public benchmark datasets covering 9 major visual-language understanding tasks, demonstrating superior performance in complex reasoning and segmentation tasks [31][33]. Group 5: Performance Metrics - In the ReVOS reasoning segmentation benchmark, UniPixel-3B achieved a score of 62.1 J&F, surpassing all existing models, indicating its strong capability in associating complex text prompts with pixel-level mask generation [33]. - The model also excelled in other datasets such as MeViS, Ref-YouTube-VOS, and RefCOCO, showcasing its leading performance across various visual understanding tasks [33][34]. Group 6: Future Implications - The introduction of UniPixel marks a significant milestone in multimodal AI, transitioning from "modal alignment" to "fine-grained understanding," effectively merging object referring and segmentation with language reasoning [47][48].
人工智能年度榜单火热报名中!五大奖项,寻找AI+时代的先锋力量
量子位· 2025-10-16 06:11
组委会 发自 凹非寺 量子位|公众号 QbitAI 为了让更多从业者感受智能浪潮的跃迁,也为了给予更多同行同路人掌声与鼓舞,我们将正式启动 「2025人工智能年度榜单」评选报名 。 这是量子位人工智能年度榜单的 第8年 。八年来,我们见证了技术的突破与落地,产业的融合与重塑,也见证了一批又一批推动时代前行 的企业、人物与产品。 在人工智能重新定义一切的时代里,智能技术已不再是单一工具,而是产业与社会协同进化的驱动力。我们期待通过这场年度评选,去发现 并致敬那些真正引领变革、开拓边界的探索者与实践者。 本次评选将从 企业 、 产品 、 人物 三大维度,设立五类奖项。欢迎企业踊跃报名! 让我们共同见证年度之星,点亮未来的方向。 企业榜 详细评选标准及报名方式如下。 2025 人工智能年度领航企业 2025 人工智能年度 领航企业 2025 人工智能年度 潜力创业公司 2025 人工智能年度 杰出产品 2025 人工智能年度 杰出解决方案 将面向中国人工智能领域,评选出最具综合实力的企业, 参选条件 : 评选标准 : 2025 人工智能年度潜力创业公司 产品榜 人物榜 2025 人工智能年度 焦点人物 聚焦于中国人 ...
库克在抖音卖iPhone,M5芯片却偷偷上MacBook Pro,网友:没有Pro/Max,你咋敢?
量子位· 2025-10-16 06:11
henry 发自 凹非寺 量子位 | 公众号 QbitAI 这头 库克 前脚还在抖音直播间带货新iPhone,大洋彼岸那头的苹果总部就官宣了全新的 M5芯片 。 这枚全新芯片将率先登场于新一代的 MacBook Pro 、 iPad Pro 和 Apple Vision Pro 。 (国行售价分别为12999元、8999元和29999元) 首先,最让人感到疑惑的就是:基础版M5芯片要放到MacBook Pro上? 据苹果介绍,M5芯片的10核GPU,配备了 神经引擎加速器 ,让基于GPU的AI任务处理速度大幅提升。 GPU峰值性能较M4翻了4倍有余,整体图形性能最高提升可达 45% 。 M5芯片的统一内存带宽也提升近 30% ,从M4的120GB/s跃升至153GB/s。 这些性能提升看上去很猛,但眼尖的网友一针见血: 没有Pro、Max版本,就把M5直接安到MacBook Pro上? 那Air里该装什么呢? 其实对于这次的新芯片,网友的吐槽可远不止这些…… Only apple can do,也就苹果能干得出来 虽然Apple硬件技术高级副总裁 Johny Srouji 信誓旦旦地表示: M5 芯片代表着A ...
新豆包模型让郭德纲喊出发疯文学:(这班)不上了!不上了!不上了!!!
量子位· 2025-10-16 06:11
Core Viewpoint - The article discusses the advancements in AI voice technology by Huoshan Engine, particularly focusing on the upgrades to the Doubao voice synthesis and voice replication models, which enhance emotional expression and contextual understanding in AI-generated speech [5][11][41]. Group 1: AI Voice Technology Upgrades - Huoshan Engine has upgraded its Doubao voice synthesis model to version 2.0, which allows for better emotional expression and understanding of dialogue [7][11]. - The upgrade includes two main models: Doubao voice synthesis model 2.0 and Doubao voice replication model 2.0, enabling AI to replicate voices and understand emotional nuances [7][8]. - The new models can interpret user instructions regarding emotions, dialects, tones, and speech rates, significantly improving the quality of AI-generated speech [12][21]. Group 2: Contextual Understanding and Emotional Expression - The models can now incorporate context from previous dialogue, enhancing the coherence and emotional depth of the generated speech [12][23]. - The ability to accurately read complex formulas has improved, with the Doubao model achieving around 90% accuracy in reading complex formulas for school subjects, compared to less than 50% for similar models [24][25]. - The advancements allow for a more human-like interaction, moving from merely sounding human to truly understanding human emotions and context [11][41]. Group 3: Technological Innovations and Applications - The Doubao large model 1.6 has been upgraded to support adjustable thinking lengths, allowing users to balance effectiveness, latency, and cost [30][33]. - Huoshan Engine has introduced a Smart Model Router, which optimally matches user tasks with the most suitable models, significantly reducing costs by up to 71% in cost-prioritized modes [39][41]. - The technology has been applied in various commercial scenarios, enhancing user experiences in products from companies like Xiaomi and OPPO, and improving complex demand responses in platforms like Dongchedi [45][46]. Group 4: Growth and Infrastructure - The daily token usage of the Doubao large model has surged from 120 billion to over 30 trillion, marking a 253-fold increase in just over a year [47][48]. - This growth is supported by Huoshan Engine's robust AI cloud infrastructure, which provides the necessary computational power and high-quality data for model training and inference [48].
库克人在北京,安卓AiPhone 4499元贴脸开卖!
量子位· 2025-10-16 01:33
Core Viewpoint - The article discusses the launch of the Honor Magic8 series, highlighting its advanced AI capabilities and hardware improvements, positioning it as a competitive alternative to Apple's iPhone, particularly in the AI smartphone market [3][52]. Group 1: Product Features - The Honor Magic8 series was officially launched with a starting price of 4499 yuan [3]. - It features a new YOYO AI system that can learn and evolve, enhancing user interaction and functionality [4][24]. - The series includes a standard version and a Pro version, maintaining a familiar design while introducing new color options inspired by Song Dynasty ceramics [9][12]. Group 2: Battery and Performance - The Magic8 series boasts a battery capacity exceeding 7000mAh, the largest in Honor's history, along with 120W fast charging capabilities [15]. - It is powered by a 3nm processor and runs on MagicOS 10.0, with the Pro version achieving an AnTuTu score of over 4.28 million, setting a record in the smartphone industry [16][19][20]. Group 3: AI Capabilities - The YOYO AI system is enhanced with a self-evolving model, allowing it to learn from user interactions and improve over time [24][25]. - YOYO can assist with online shopping, making it easier for users to find the best deals [26][28]. Group 4: Camera System - The Magic8 Pro features a 200MP camera with advanced low-light capabilities and a new stabilization system to reduce motion blur [33][42]. - The camera system is designed to perform well in various lighting conditions, competing effectively with high-end models like the iPhone 17 Pro [38][40]. Group 5: Future Developments - Honor also introduced the MagicPad3 Pro tablet, which shares the same high-performance processor as the Magic8 series [45][46]. - A future AI terminal, the ROBOT PHONE, is expected to be unveiled in 2026, showcasing Honor's commitment to innovation in AI technology [50].
AI挖出癌症潜在新疗法!谷歌耶鲁联手突破免疫系统冷肿瘤难题
量子位· 2025-10-16 01:33
Core Viewpoint - The article discusses a significant advancement in cancer treatment through the collaboration between Google and Yale, focusing on a new AI model called Cell2Sentence-Scale 27B, which aims to enhance immune signals in cold tumors, a challenging area in cancer immunotherapy [1][2][4]. Group 1: AI and Cancer Treatment - The Cell2Sentence-Scale 27B model has been developed to identify drugs that can enhance immune signals in specific immune environments, addressing the issue of cold tumors that evade immune detection [4][12]. - The model has been made available to the research community, promoting collaboration and further research in the field [5]. Group 2: Cold Tumors Explained - Cold tumors are characterized by a lack of immune signals, making them difficult for the immune system to recognize and attack [7][10]. - Unlike hot tumors, which attract immune cells, cold tumors can suppress immune activity and disguise their presence [8][9]. Group 3: Model Testing and Findings - The model simulated two immune environments: one with low levels of interferon and another completely devoid of immune signals, testing over 4,000 drugs [14][16]. - The promising candidate identified was the CK2 inhibitor silmitasertib, which showed potential when combined with low-dose interferon to enhance antigen presentation, a critical step for immune recognition of tumors [16][17].
「重要性采样」并不「重要」?快手清华ASPO攻克重要性采样权重错配
量子位· 2025-10-15 10:20
Core Insights - Reinforcement Learning (RL) has become a crucial component in the post-training phase of Large Language Models (LLMs) like ChatGPT and DeepSeek [1] - A significant issue has emerged with the increasing scale of model parameters: the importance sampling (IS) mechanism may not be as beneficial as previously thought [2][5] - The research team from Kuaishou and Tsinghua University identified a deep-rooted "weight mismatch" phenomenon in existing supervised RL paradigms, leading to overconfidence in models and potential issues like entropy collapse and premature convergence [2][6] Importance Sampling Issues - Importance sampling is intended to correct the distribution differences between old and new policies, allowing models to reuse old data without deviating from the target distribution [5] - In small-scale RL, IS is effective; however, it fails in the context of supervised RL for large language models [6] - Experiments showed that in GRPO algorithms, IS did not provide the expected benefits and instead contributed to training instability [7] Weight Mismatch and Self-Reinforcing Loops - The research revealed that the advantage values in supervised RL are inaccurate, as different tokens contribute differently to the final answer [8] - The average IS weight for positive advantage tokens is higher than for negative ones, leading to a decrease in entropy [9] - IS in supervised RL algorithms has shifted from being a correction term to a token-level weight, causing a self-reinforcing loop that reinforces high-scoring tokens while neglecting low-probability ones [11][12] ASPO Algorithm Introduction - The proposed ASPO (Asymmetric Importance Sampling Policy Optimization) algorithm addresses these issues by inverting the IS weights for positive advantage tokens, allowing low-probability tokens to receive stronger updates [3][18] - ASPO incorporates a Dual-Clipping mechanism to manage extreme values resulting from the inverted weights, ensuring stability while maintaining effective gradient flow [20] Experimental Results - ASPO demonstrated significant advantages in various benchmarks, including mathematical reasoning and code generation tasks, outperforming traditional methods [24] - The average performance improvement was 12.5% for mathematical tasks and 17.0% for code generation tasks, with smoother training curves and reduced entropy collapse [26] - ASPO achieved notable results in the LiveCodeBench v5 benchmark, indicating its superiority over mainstream RL methods [26][27]
Sora2不够香了!这款国产AI视频模型已经能边看边生成,生成快还互动佳
量子位· 2025-10-15 10:20
Core Viewpoint - The article emphasizes that Baidu's Steam Engine has achieved a significant leap in AI video generation technology, moving from traditional short video creation to real-time, interactive, and long-form video production, thus redefining the creative process in AI video generation [5][9][44]. Group 1: Technological Advancements - Baidu's Steam Engine has become the first to achieve integrated audio and video generation in Chinese, marking a milestone in the AI video generation field [5][61]. - The model supports real-time interaction, allowing users to pause and modify video generation on-the-fly, which contrasts with existing models that require lengthy waiting periods for output [6][15][42]. - The introduction of autoregressive diffusion models enables low-cost, real-time generation and interaction, significantly enhancing the efficiency and quality of video output [45][47]. Group 2: User Experience and Accessibility - Users can generate long videos simply by uploading a single image and providing a prompt, drastically lowering the barrier to entry for video creation [18][56]. - The platform allows for real-time previews and modifications, enabling a more engaging and participatory creative process [49][56]. - The system's design caters to non-professionals, making it accessible for a broader audience without requiring extensive video editing skills [55][58]. Group 3: Market Position and Future Implications - Baidu's Steam Engine has positioned itself as a leader in the AI video generation market, achieving the highest score on the VBench-I2V global ranking for video generation models [61][62]. - The advancements signify a shift from fragmented video generation to continuous storytelling, indicating a new era in AI content creation that emphasizes collaboration and interactivity [63][64]. - The technology is expected to extend its applications across various sectors, including e-commerce, live streaming, education, and film production, enhancing the overall utility of AI-generated content [58][59].
AI玩拼图游戏暴涨视觉理解力,告别文本中心训练,无需标注的多模态大模型后训练范式
量子位· 2025-10-15 10:20
VisualJigsaw团队 投稿 量子位 | 公众号 QbitAI 在多模态大模型的后训练浪潮中,强化学习驱动的范式已成为提升模型推理与通用能力的关键方向。 然而,大多数现有方法仍 以文本为中心 ,视觉部分常被动地作为辅助信号输入。相比之下,我们认为在后训练阶段重新审视 视觉自监督学 习 的潜力,设计 以视觉为中心 的后训练对于增强多模态大模型对于视觉信息本身的细粒度深入理解也同样至关重要。 为此,来自MMLab@南洋理工大学的最新论文 《Visual Jigsaw Post-Training Improves MLLMs》 提出了一种全新的针对多模态大模 型后训练任务- Visual Jigsaw 。 它将经典的自监督拼图任务重新设计为多模态大模型后训练阶段的核心目标,让模型在不依赖额外标注、也无需视觉生成模块的情况下,显式 强化自身的视觉感知与理解能力。在图片,视频,和3D三种视觉模态下都验证了其有效性。 Visual Jigsaw 方法简介 对于不同视觉模态,具体的Visual Jigsaw任务设计如下 Image Jigsaw: 图片在2D空间上被划分为 个相同大小的子图,打乱后模型需恢复正确的空间 ...
波士顿动力狗gogo回来了!“五条腿”协同发力
量子位· 2025-10-15 10:20
Core Insights - The article discusses the advancements in Boston Dynamics' Spot robot, which can lift and manipulate a tire weighing 15 kg in just 3.7 seconds, showcasing its dynamic whole-body manipulation capabilities [3][31]. Group 1: Dynamic Whole-Body Manipulation - The method combines sampling and learning for dynamic whole-body manipulation, utilizing reinforcement learning and sampling-based control to enable coordinated tasks involving arms, legs, and torso [11][12]. - A hierarchical control approach is employed, dividing control problems into two complementary layers: a low layer for direct motor torque control and a high layer for task-specific strategies [12][13]. Group 2: Task Execution and Control Strategies - For tasks like tire alignment and stacking, the system uses sampling-based control to simulate potential future scenarios and discover optimal strategies [14]. - Reinforcement learning is applied to maintain stability during rolling tasks, capturing the necessary dynamic features and reactive control mechanisms [15][26]. Group 3: Performance and Efficiency - The Spot robot's performance in tire manipulation exceeds traditional static assumptions, demonstrating the ability to handle weights beyond its peak lifting capacity of 11 kg [35]. - The robot's dynamic coordination of movements allows it to efficiently perform tasks that were previously limited to slower, static methods [36][33]. Group 4: Simplification of Control Problems - Separating high-level and low-level control significantly simplifies the control challenges, allowing the high-level controller to focus on task completion without needing to reason about joint torques or stability constraints [37][38]. - The learned motion abstractions enable the high-level controller to operate in a simplified action space, enhancing computational feasibility and task execution efficiency [38].