机器之心
Search documents
TPAMI | DC-SAM:打破SAM交互限制,基于循环一致性的图像与视频上下文分割方法
机器之心· 2026-01-20 04:51
上下文分割(In-Context Segmentation)旨在通过参考示例指导模型实现对特定目标的自动化分割。尽管 SAM 凭借卓越的零样本泛化能力为此提供了强大的基础, 但将其应用于此仍受限于提示(如点或框)构建,这样的需求不仅制约了批量推理的自动化效率,更使得模型在处理复杂的连续视频时,难以维持时空一致性。 北京邮电大学联合南洋理工大学等 机构发表的 IEEE TPAMI 期刊论文《DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency》,不仅为 图像和视频的上下文分割建立了统一的高效框架 DC-S A M ,还构建了首个视频上下文分割基准 IC-VOS 。 研究团队巧妙地提出基于提示微调的 "循环一致性" 机制,通过正负双分支与循环一致性注意力的协同,配合 Mask-Tube 策略,实现了 SAM 与 SAM2 在图像及视 频上下文分割任务上的统一与高效适配。 实验结果显示,DC-SAM 在多个基准测试中均取得了 SOTA 性能:在 COCO-20 上达到 55.5 mIoU,在 Pascal-5 上达 ...
一周对战2500万局,这些「AI假人」让人类游戏玩家破防了
机器之心· 2026-01-20 04:51
Core Insights - The article discusses the innovative integration of AI in the gaming industry, specifically highlighting the success of Giant Network's game "Supernatural Action Team," which has achieved over 1 million concurrent users and dominated app store rankings [1][4][8] - The game features AI-controlled characters that interact with players in real-time, creating a dynamic and unpredictable gaming experience, which contrasts with traditional scripted NPCs [3][12][14] - The implementation of AI in core gameplay has proven to enhance player engagement and satisfaction, challenging the notion that AI could negatively impact gaming experiences [23][24] AI Integration in Gaming - "Supernatural Action Team" is the first major game in China to deeply integrate AI models into its core gameplay, allowing AI to act as both allies and adversaries [8][12] - The AI characters are designed to adapt their behavior based on player actions, creating a more immersive and strategic gaming environment [13][16] - The game has successfully conducted nearly 25 million matches within a week of the new AI feature's launch, demonstrating the system's stability and scalability [7][21] Industry Trends and Challenges - The gaming industry is increasingly welcoming AI, with applications ranging from asset generation to real-time matchmaking, but most implementations remain in a "safe zone" where AI does not directly affect core gameplay [10][11] - Concerns about AI's impact on player experience have led many developers to hesitate in integrating AI into core gameplay, with a significant percentage of players expressing skepticism about AI's role in gaming [22][23] - "Supernatural Action Team" provides a counterexample, showing that when AI is effectively integrated into gameplay, it can enhance the overall experience and be well-received by players [24] Future Implications - The successful integration of AI in "Supernatural Action Team" may pave the way for new content generation methods in gaming, where AI becomes a fundamental part of gameplay rather than just a tool [24] - The article suggests that the future of gaming may involve more frequent and interactive AI-driven experiences, opening new avenues for player engagement and industry growth [24]
机器人终于「懂」家务了!伯克利MomaGraph让机器人像人一样做家务
机器之心· 2026-01-19 08:54
想象这样一个日常画面:你吩咐家用机器人「烧壶开水」,它却当场卡壳——水壶在哪?该接自来水还是过滤水?先插电还是先按开关?水开了又该如何判 断?这些对人类而言像呼吸一样自然的家务,对过去的机器人却是大大的难题:要么忘了插电,要么找不到水壶,甚至会把柜门把手错当成开关一通乱按。 最近,加州伯克利和马里兰大学联手推出的 MomaGraph 技术 ,就是要让机器人彻底告别这种「做家务的人工智障」时刻。这套算法不仅能让机器人真正 理解「做事的先后顺序」,更在星动纪元星动 Q5 上成功完成了开柜子、开微波炉、开电视、关灯等真实家务。 一、研究背景: 家用机器人做不好家务的「三大卡点」 家用移动操作机器人(比如帮你开窗户、热牛奶的机器人)需要同时「看路」(导航)和「动手」(操作),但过去的技术一直存在三个关键问题卡点,导 致机器人「做不好家务」: 卡点 1:只知「在哪」,不知「咋用」 比如机器人要开窗户,传统技术可能只知道「窗户在书桌右边」(空间关系),但不知道「窗户把手能控制开关」(功能关系)——就像你知道手机在口袋 里,却不知道按电源键能开机,自然用不了手机。 卡点 2:只认「图片」,不认「变化」 传统模型会把场景当成 ...
评审用不用AI,作者说了算?ICML 2026全新评审政策出炉
机器之心· 2026-01-19 08:54
Core Viewpoint - ICML 2026 has introduced a new review type selection mechanism allowing authors to decide whether to permit the use of large language models (LLMs) in the review process [3][9]. Group 1: Review Policy Changes - Two policies have been established: Policy A strictly prohibits the use of any LLMs during the review process, while Policy B allows their use with specific restrictions [4]. - Allowed actions under Policy B include using LLMs to assist in understanding the paper, language polishing of review comments, and querying LLMs for strengths or weaknesses of the paper [7][9]. - The choice of whether to allow LLMs in the review process is now in the hands of the authors, marking a significant shift from previous practices where the decision was primarily up to reviewers [9]. Group 2: Implementation Challenges - There are concerns regarding the enforcement of the new regulations on LLM usage, as past experiences have shown a prevalence of AI-generated reviews [11][13]. - A study on ICLR 2026 revealed that 21% of review comments were entirely generated by AI, indicating a widespread reliance on AI tools in the review process [11]. - The effectiveness of ICML's new rules may be limited, as compliance by reviewers cannot be guaranteed, raising questions about the integrity of the review process [14][15]. Group 3: Author Control and Options - Authors now have the option to refuse LLM-assisted reviews, providing a "one-size-fits-all" choice that may address concerns about trust in the review process [16].
租了8张H100,他成功复现了DeepSeek的mHC,结果比官方报告更炸裂
机器之心· 2026-01-19 08:54
Core Insights - DeepSeek's mHC architecture addresses numerical instability and signal explosion issues in large-scale training by extending traditional Transformer residual connections into a multi-stream parallel architecture [1][5] - The mHC model has garnered significant attention in the AI community, with successful reproductions yielding better results than the original DeepSeek paper [5][6] Group 1: mHC Architecture - The mHC model utilizes the Sinkhorn-Knopp algorithm to constrain the connection matrix to a doubly stochastic matrix manifold, ensuring stability during training [1][25] - Traditional residual connections in Transformers have remained unchanged since 2016, relying on a single information flow, while mHC introduces multiple parallel streams for enhanced expressiveness [9][14] - The mHC architecture maintains stability by preventing signal amplification, which can lead to catastrophic failures in large models [20][28] Group 2: Experimental Results - In experiments with 10M parameters, the original hyper-connection (HC) model exhibited a signal amplification of 9.2 times, while mHC maintained stability with an amplification of 1.0 [36][61] - Scaling up to 1.7B parameters, the HC model showed an alarming amplification of 10,924 times, highlighting the instability associated with larger models [54][66] - The experiments demonstrated that while HC models accumulate instability, mHC models consistently maintain structural integrity across different training conditions [70][71] Group 3: Implications and Future Directions - The findings suggest that while traditional residual connections are stable, they may not be optimal for larger models, as mHC offers a balance between expressiveness and stability [57][58] - Future research aims to explore scaling laws further, particularly at the 10B parameter scale, where significant amplification trends are anticipated [101] - The mHC approach not only mitigates instability but also eliminates the risk of catastrophic failures in large-scale training scenarios [93][96]
让机器人看视频学操作技能,清华等全新发布的CLAP框架做到了
机器之心· 2026-01-19 03:51
Core Insights - The article discusses the introduction of the Contrastive Latent Action Pretraining (CLAP) framework, developed by Tsinghua University in collaboration with Stardust Intelligence, HKU, and MIT, which enables robots to learn skills directly from videos [2][3]. Group 1: Challenges in Robot Learning - The article highlights a long-standing issue in robot learning known as "data scarcity," where there is an abundance of human behavior videos online but a lack of data specifically for training robots [3]. - The root cause of this data asymmetry is the high cost and inefficiency associated with collecting robot operation data, which requires expensive hardware, specialized environments, and extensive manual labeling [3]. - Traditional latent action models face the "visual entanglement" problem, where models learn irrelevant visual noise instead of actual manipulation skills [3]. Group 2: Innovations of the CLAP Framework - The CLAP framework addresses the technical bottleneck of aligning the motion space extracted from videos with the robot's action space, effectively avoiding the visual entanglement issue [3]. - By utilizing contrastive learning, CLAP maps state transitions in videos to a quantifiable, physically executable action codebook [3]. - The framework allows robots to learn skills from vast amounts of video data available on platforms like YouTube and Douyin, significantly expanding the scale of usable training data [4]. Group 3: Training Methodology - The research team trained the CLAP framework using two modeling paradigms: CLAP-NTP, a self-regressive model excelling in instruction following and object generalization, and CLAP-RF, a strategy based on Rectified Flow aimed at high-frequency, precise control [4][10]. - The framework employs a knowledge matching (KM) regularization strategy to mitigate catastrophic forgetting during the fine-tuning process, ensuring that robots retain previously learned skills while acquiring new ones [4][10]. Group 4: Practical Implications - The long-term value of the CLAP framework lies not only in its technical innovation but also in its potential to accelerate the industrialization of robotics by reducing the cost and time required for deploying robots in various sectors such as services and manufacturing [6]. - The unified visual-language-action (VLA) framework allows for the effective integration of the precision of machine data with the semantic diversity of large-scale unannotated human video demonstrations [8]. Group 5: Experimental Results - Extensive experiments demonstrate that CLAP significantly outperforms strong baseline methods, enabling effective skill transfer from human videos to robot execution [12]. - Performance comparisons in real-world tasks show that CLAP-NTP and CLAP-RF achieve higher success rates in various tasks compared to baseline methods, indicating the framework's robustness and effectiveness [14][15].
你的论文有novelty吗?复旦搞了个顶会论文查新系统
机器之心· 2026-01-19 03:51
Core Viewpoint - The article discusses the development of OpenNovelty, an automated novelty analysis system designed to enhance the academic review process by providing verifiable evidence for claims of novelty in research papers [4][25]. Group 1: System Overview - OpenNovelty is a collaboration between Fudan University's NLP research team and the academic search platform WisPaper, aimed at addressing the challenges of assessing novelty in academic submissions [4]. - The system emphasizes the need for verifiable evidence when judging the novelty of a paper, requiring that any claim of insufficient novelty be supported by traceable evidence from published literature [7][25]. Group 2: Analysis Process - The analysis process consists of four steps: 1. Core information extraction from the paper's title, abstract, and introduction [9]. 2. Literature retrieval and filtering to generate a candidate set of relevant papers [11]. 3. Hierarchical analysis and evidence comparison to assess the novelty claims [14]. 4. Generation of a novelty investigation report that consolidates findings and provides traceable evidence [20][21]. Group 3: System Functionality - The system utilizes a query expansion mechanism to generate multiple semantically equivalent variations of extracted information, ensuring comprehensive literature retrieval [7]. - It categorizes the comparison results into three outcomes: can refute, cannot refute, and unclear, based on the evidence found [15][17][19]. Group 4: Impact and Utility - OpenNovelty serves as an auxiliary tool for reviewers, helping them navigate the literature landscape and focus on critical aspects of the review process [26]. - For authors, it acts as a self-check tool to verify the novelty of their research and identify any overlooked relevant literature [27]. - The system aims to provide a verifiable path for novelty assessment, enhancing accountability in academic publishing [27]. Group 5: Limitations and Future Directions - The team acknowledges the system's limitations, emphasizing that it is a supportive tool rather than a decision-making entity, with final judgments still resting with human reviewers [29][30]. - OpenNovelty is positioned as a third-party auditing system, intended to clarify evidence during the final decision-making phase of the review process [31].
效果、性能双突破,快手OneSug端到端生成式框架入选AAAI 2026
机器之心· 2026-01-19 01:27
当你在电商平台搜索"苹果",系统会推荐"水果"还是"手机"?或者直接跳到某个品牌旗舰店?短短一个词,背后承载了完全不同的购买意图。而推荐是否精 准,直接影响用户的搜索体验,也影响平台的转化效率。 查询推荐(Query Suggestion)是现代电商搜索系统中的关键功能,通过在用户输入过程中实时推荐相关查询,帮助用户快速明确意图,提升搜索体验与转化效 率。传统方法通常采用多阶段级联架构(MCA),虽然在效率与效果之间取得了一定平衡,但由于各阶段目标不一致、长尾查询召回困难等问题,限制了系统性 能的进一步突破。 基于上述问题,快手在业界首次提出端到端的生成式统一查询推荐框架 ——OneSug,成功将召回、粗排、精排等多个阶段统一在一个生成模型中,显著提升了推 荐效果与系统效率,在快手电商场景中实现了业务指标与用户体验的双重提升。 本工作相关成果《OneSug: The Unified End-to-End Generative Framework for E-commerce Query Suggestion》已被人工智能顶级会议 AAAI 2026 接收。 论文链接:https://arxiv.org/abs ...
CES 2026趋势照进现实:算力引擎RK182X重塑千行百业,瑞芯微AI生态大会共建落地生态
机器之心· 2026-01-19 01:27
以下文章来源于瑞芯微电子 ,作者瑞芯微电子 瑞芯微电子 . 传递瑞芯微电子的最新产品信息及市场动态 机器之心发布 一年一度的"科技春晚"CES2026于上周落下帷幕,从今年的主题"定义AI的物理边界(Physical AI)",可以看出全球科技新趋势正在推动AI从虚拟走向现 实应用,通过多元化的消费电子、机器人、智能汽车等实体形态让生活智能化变得具象。 瑞芯微作为国内AIoT芯片 领域的领军企 业,正在这一科技浪潮中扮演着重要角色。全球首颗3D架构协处理器RK182X系列芯片的技术突破,不仅为全 球"Physical AI"的发展提供强大的硬件和算力支撑,更是推进千行百业用AI重做一遍的AIoT2.0时代的落地进程。同时,瑞芯微将举办AI软件生态大会, 依 托在AIoT千行百业、超过5000家全球客户的广大生态, 搭建起AI软件与市场的桥梁。 瑞芯微RK1 82X:AIoT 2.0的算力引擎 从"功能机"到"智能体"的本质进化 。 基于实测数据显示, RK182X运行Qwen2.5-3B模型输出速度突破百Token,是市场对标产品的3倍; 同时 在多模态视觉语言模型任务上,瑞芯微已率先支 持Qwen3-VL- ...
AAAI 2026|相聚新加坡,探讨AI时代最核心难题
机器之心· 2026-01-18 06:48
Group 1 - The core theme of the events is the exploration of human agency in the context of AI, focusing on how to preserve meaningful human decision-making rights amidst the evolving landscape of artificial intelligence [2][4] - The first seminar titled "The Right to Work, Learn, Own & Choose" aims to integrate the technical AI community with AI governance to promote respect for human agency and protect rights related to work, learning, ownership, and choice [2][4] - The event features prominent speakers from various institutions, including Ashok Goel from Georgia Tech and Jungpil Hahn from the National University of Singapore [4] Group 2 - The second seminar, "Agentic AI meets Autonomous Agents and Multiagent Systems," focuses on advancements in intelligent agents based on large language models (LLMs) and the lessons learned in building and deploying these systems [11][13] - This seminar emphasizes the transition of modern "Agentic AI" systems from demonstrations to practical deployment, requiring capabilities in long-term planning, reliable tool usage, and robust interaction with humans and environments [13][14] - Notable speakers include Leslie Kaelbling from MIT and Bo Li from the University of Illinois at Urbana-Champaign, contributing to discussions on the long-term challenges in robotics and multi-agent systems [17]