机器之心
Search documents
AAAI 2026 Oral|快手提出全新「检索数据引擎」CroPS,打破搜索信息茧房
机器之心· 2026-01-12 05:01
Core Insights - The article discusses the introduction of a new retrieval data engine called CroPS (Cross-Perspective Positive Samples) by Kuaishou's search team, aimed at improving short video search capabilities by addressing the limitations of traditional self-reinforcing training paradigms that rely heavily on historical click data [2][10]. Group 1: Problem Identification - Current vector retrieval models in the industry often depend on historical user interaction data, leading to a self-reinforcing cycle that narrows the search results and limits exposure to diverse content [6]. - This mechanism results in a significant sample bias, where high-quality long-tail content is systematically excluded from positive samples, causing the model's retrieval scope to become conservative and repetitive [6][7]. - Users experience a lack of novelty in search results, making it difficult to satisfy exploratory needs [7]. Group 2: CroPS Framework - CroPS introduces a multi-dimensional positive sample enhancement engine that utilizes user query behavior, recommendation system feedback, and knowledge from large language models (LLMs) to enrich the semantic space [11]. - The framework captures user intent continuity by analyzing query rewrites, allowing the system to correct semantic biases by incorporating successful interactions from related queries [12]. - It breaks down barriers between search and recommendation systems, enabling the retrieval model to leverage diverse content that users may not have actively searched for [15]. - CroPS employs LLMs to generate high-quality synthetic samples when existing content does not cover certain queries, effectively expanding the model's knowledge base [16][17]. Group 3: Hierarchical Labeling and Loss Function - The Hierarchical Label Assignment (HLA) strategy addresses the reliability differences among positive samples from various sources, allowing the model to prioritize more relevant samples during training [19]. - H-InfoNCE loss function enhances the model's ability to distinguish between high-priority and low-priority samples, aligning learning objectives with the hierarchical logic of HLA [23][28]. Group 4: Experimental Results - Offline experiments showed that CroPS improved recall rates by 9.5% on user click datasets and 7.1% on user query change datasets compared to the strongest baseline [30]. - In large-scale A/B testing, CroPS led to significant business growth, with a 40.9% increase in ratio rank and a 44.3% increase in ratio show for dense models [31]. - The click-through rate (CTR) increased by 0.869%, and the long playback rate (LPR) rose by 0.483%, indicating improved content relevance and quality [36]. Group 5: Future Directions - The Kuaishou search team plans to explore the integration of CroPS with generative retrieval methods to further leverage the potential of large-scale language models in the search process [34].
顶尖AI竟输给三岁宝宝,BabyVision测试暴露多模态模型硬伤
机器之心· 2026-01-12 05:01
Core Viewpoint - The article discusses the limitations of current large models in visual understanding, emphasizing that while they excel in language and text reasoning, their visual capabilities remain underdeveloped, akin to that of a three-year-old child [3][4][49]. Group 1: BabyVision Overview - UniPat AI, in collaboration with Sequoia China and various research teams, has launched a new multimodal understanding evaluation set called BabyVision to assess visual capabilities of AI models [3][4]. - BabyVision aims to create a new paradigm for AI training, evaluation, and application in real-world scenarios, focusing on generating measurable and iterative visual capabilities [4][49]. Group 2: Evaluation Methodology - BabyVision includes a direct comparison experiment with 20 vision-centric tasks given to children of different ages (3, 6, 10, 12 years) and top multimodal models [7]. - The evaluation strictly controls language dependency, requiring answers to be derived solely from visual information [8]. Group 3: Results and Findings - The results reveal that most models score significantly below the average performance of three-year-old children, with the best model, Gemini3-Pro-Preview, only achieving 49.7%, which is still 20 percentage points below the performance of six-year-olds [15][21]. - Human participants scored an impressive 94.1% accuracy on the BabyVision-Full test, highlighting the substantial gap between human and model performance [20][21]. Group 4: Challenges Identified - The study identifies four core challenges in visual reasoning for AI models: observing non-verbal details, maintaining visual tracking, lacking spatial imagination, and difficulty in visual pattern induction [27][30][36][39]. - These challenges indicate a systemic lack of foundational visual capabilities in current models, rather than isolated deficiencies [23]. Group 5: Future Directions - The article suggests that transitioning visual reasoning tasks to visual operations, as demonstrated in BabyVision-Gen, may help bridge the gap in visual understanding [42]. - The ongoing development of BabyVision aims to guide the evolution of multimodal large models by breaking down visual understanding into 22 measurable atomic capabilities [49].
被Jim Fan点赞!全球第一的千寻智能Spirit v1.5正式开源!
机器之心· 2026-01-12 01:20
Core Insights - The article discusses the significant advancements in embodied intelligence, particularly highlighting the emergence of the Spirit v1.5 model from Qianxun Intelligent, which has surpassed previous models like Pi0.5 in performance [3][4][15]. Group 1: Key Developments in Embodied Intelligence - 2025 marked a breakthrough year for embodied intelligence, with hardware advancements and the development of foundational models defining the intelligence ceiling of this technology [3]. - The Spirit v1.5 model was open-sourced on January 12, 2026, and achieved the top position in the RoboChallenge's Table30 ranking, outperforming Pi0.5 [4][8][15]. - The open-sourcing of Spirit v1.5 includes model weights, inference code, and usage examples, allowing for public verification and community innovation [6][33]. Group 2: Performance Metrics and Evaluation - Spirit v1.5 scored 66.09 on the RoboChallenge leaderboard, while Pi0.5 scored 61.84, indicating a significant performance improvement [11][14]. - The RoboChallenge platform focuses on real-world physical testing, with a task set designed to challenge various dimensions of model capabilities, including precise 3D positioning and multi-stage tasks [15]. Group 3: Data Utilization and Training Paradigms - Spirit v1.5's success is attributed to a fundamental restructuring of the robot pre-training data paradigm, moving away from "clean data" to a more diverse and open-ended data collection strategy [18][20]. - The model's training involved continuous skill flows and internalized error correction capabilities, allowing it to adapt dynamically to unexpected challenges [21][22]. Group 4: Implications for the Industry - The open-source nature of Spirit v1.5 represents a significant shift in the industry, providing a competitive alternative to closed-source models and enabling broader access to high-performance robotics technology [35][39]. - The model's development and open-source release are seen as a pivotal moment for Chinese teams in the global AI landscape, transitioning from followers to leaders in defining core technological paths [41][42].
Sakana让AI互相「猎杀」,而它们开始了趋同进化
机器之心· 2026-01-11 10:03
Core Insights - The article discusses the collaboration between Sakana AI and MIT on a new research project called Digital Red Queen (DRQ), which explores self-evolving assembly code through a competitive programming environment [2][3]. - The research utilizes the classic programming game "Core War" to create a dynamic adversarial environment where AI programs, referred to as "warriors," evolve by competing against each other [3][7]. Group 1: Research Methodology - The DRQ method allows AI programs to evolve by continuously adapting to changing opponents rather than static benchmarks, leading to the generation of robust and versatile "warriors" [3][8]. - The study positions "Core War" as a sandbox for examining the dynamics of artificial systems in competitive environments, such as cybersecurity [7][8]. Group 2: Evolutionary Dynamics - The research reveals that the dynamic adversarial process encourages models to develop increasingly general strategies, demonstrating a phenomenon known as convergent evolution, where different programs exhibit similar high-performance behaviors [8][26]. - As the DRQ runs increase, the warriors become more robust and generalizable, indicating a trend towards phenotypic convergence, where behaviors become similar despite differing underlying implementations [29][30]. Group 3: Implications and Future Work - The findings suggest that the DRQ algorithm, combined with the "Core War" environment, could provide valuable insights into the nature of adversarial competition and the evolution of AI systems in real-world scenarios [34]. - Future research may explore richer settings that allow multiple agents to co-evolve simultaneously, better simulating real-world dynamics where large populations adapt in parallel [35].
挑战GRPO,英伟达提出GDPO,专攻多奖励优化
机器之心· 2026-01-11 04:00
但随着语言模型能力的不断提升,用户对它们的期待也在发生变化:不仅要回答正确,还要在各种不同场景下表现出符合多样化人类偏好的行为。为此, 强化学 习训练流程开始引入多种奖励信号 ,每一种奖励对应一种不同的偏好,用来共同引导模型走向理想的行为模式。 但英伟达的一篇新论文却指出,在进行多奖励优化时,GRPO 可能不是最佳选择。 具体来说,在多奖励优化场景中,GRPO 会将不同的奖励组合归一化为相同的优势值。这会削弱训练信号,降低奖励水平。 为了解决这一问题,他们提出了一种新的策略优化方法 —— 组奖励解耦归一化策略优化( GDPO )。该方法通过对各个奖励信号分别进行归一化,避免了不同奖 励之间被混合「抹平」,从而更真实地保留它们的相对差异,使多奖励优化更加准确,同时显著提升了训练过程的稳定性。 机器之心编辑部 GRPO 是促使 DeepSeek-R1 成功的基础技术之一。最近一两年,GRPO 及其变体因其高效性和简洁性,已成为业内广泛采用的强化学习算法。 论文标题:GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-re ...
联邦学习不再安全?港大TPAMI新作:深挖梯度反转攻击的内幕
机器之心· 2026-01-11 04:00
联邦学习(Federated Learning, FL)本是隐私保护的「 救星」,却可能因梯度反转攻击(Gradient Inversion Attacks, GIA)而导致防线失守。 近日, 香港大学、香港科技大学(广州)、南方科技大学、斯坦福大学、加州大学圣塔克鲁兹分校 的研究团队合作,在人工智能顶级期刊 IEEE TPAMI 上 发表重磅工作,对 GIA 进行了全方位的分类、理论分析与实验评测,并提出了切实可行的防御指南。 论文标题: Exploring the Vulnerabilities of Federated Learning: A Deep Dive into Gradient Inversion Attacks 本文第一作者郭鹏鑫,香港大学博士生,研究方向是联邦学习、大模型微调等。本文共同第一作者王润熙,香港大学硕士生,研究方法是联邦学习、隐私保 护等。本文通讯作者屈靓琼,香港大学助理教授,研究方向包含 AI for Healthcare、AI for Science、联邦学习等 (个人主页: https://liangqiong.github.io/ )。 02 方法分类:GIA 的三大 ...
不做人形、不跳舞:他家的具身智能凭什么在100+城市卖出400万杯咖啡?
机器之心· 2026-01-11 04:00
编辑|吴昕 新年刚开局, AI 行业就直接拉满强度。 在 CES 这个全球科技风向标上,机器人 × AI 成了真正的主角。在拉斯维加斯的霓虹灯下,中国机器人军团走到舞台中央 —— 不靠堆概念,而是带着订单和规模 化落地速度。 CES 创新奖评委 Chris Pereira 指出,中国厂商正在把新兴技术,快速转化为能量产、能交付、能在全球市场销售的成熟产品。 与此同时, AI 正退到幕后,成为产品底层能力,真正的竞争,落在实用性、设计与可靠执行力上。 在展会现场,最吸睛的依旧是「人形」。 波士顿动力(现在已经属于韩国现代集团)的新版 Atlas 亮相。 但在同一空间内,另一条路线也在同步展开。 在影智 XBOT 的透明橱窗前,人群一层层围拢过来。这是全球首个支持冷热双杯同出的具身机器人,也是目前一众具身智能中最落地的一种呈现。 有人举着手机录像,有人已经在讨论要把什么图案印在咖啡上。 影智 XBOT Lite 系列印花咖啡机器人 —— 全球首个支持冷热双杯同出的具身机器人。 玻璃之后,两只机械臂分工协作,打奶、印花、出杯,动作连贯得像一段被反复打磨过的编舞。 110 秒后,一杯冰美式和一杯热拿铁同时完成,杯面上 ...
在谷歌深耕14年,华人研究员创立视觉AI公司,计划融资5000万美元
机器之心· 2026-01-11 02:17
Core Insights - Two former Google researchers are founding a new visual AI company named Elorian, aiming to develop advanced AI models that can understand and process text, images, videos, and audio simultaneously [1][8] - The company is currently in discussions to raise approximately $50 million in seed funding, with Striker Venture Partners potentially leading the investment round [1] Group 1: Founders' Background - Andrew Dai, a former senior AI researcher at Google DeepMind, has 14 years of experience in AI research and management, contributing to the development of the Gemini large AI model [3] - Yinfei Yang, a former AI researcher at Apple, has extensive experience in multimodal models and has worked at Google Research, Amazon, and Redfin, focusing on visual-language representation and multimodal learning [5] Group 2: Company Vision and Goals - Elorian's primary goal is to create a multimodal AI model capable of visual understanding and analysis of the real world by processing images, videos, and audio [8] - While robotics is a potential application area, the company envisions a broader range of applications that have not yet been disclosed [8]
无需人工标注,轻量级模型运动理解媲美72B模型,英伟达、MIT等联合推出FoundationMotion
机器之心· 2026-01-11 02:17
Core Insights - The rapid development of video models faces challenges in understanding complex physical movements and spatial dynamics, leading to inaccuracies in interpreting object motion [2][6] - A significant issue is the lack of high-quality motion data, as existing datasets are either too small or heavily reliant on expensive manual annotations [3][12] - FoundationMotion, developed by researchers from MIT, NVIDIA, and UC Berkeley, offers an automated data pipeline that does not require manual labeling, significantly improving motion understanding in video models [4][13] Data Generation Process - FoundationMotion operates through a four-step automated data generation process, starting with precise extraction of motion from videos using advanced detection and tracking models [16] - The system then translates these trajectories into a format understandable by language models, enhancing the model's ability to comprehend object movements [17] - Finally, it utilizes GPT-4o-mini to automatically generate high-quality annotations and questions, resulting in a dataset of approximately 500,000 entries for motion understanding [18] Model Performance - The data generated by FoundationMotion was used to fine-tune various open-source video models, including NVILA-Video-15B and Qwen2.5-7B, leading to significant performance improvements [21] - The fine-tuned models surpassed larger models like Gemini-2.5 Flash and Qwen2.5-VL-72B on multiple motion understanding benchmarks, demonstrating the impact of high-quality data [26] Broader Implications - FoundationMotion's contributions extend beyond performance metrics, as understanding object motion is crucial for safety and decision-making in autonomous driving and robotics [24] - The system provides a cost-effective and scalable solution for AI to develop an intuitive understanding of the physical world through extensive video analysis [25] - This advancement is seen as foundational for building true embodied intelligence, enhancing both physical perception and general video understanding capabilities [26][27]
未来走向何方?Agent 创企 2025 生存现状一览
机器之心· 2026-01-11 01:30
Core Insights - The article discusses the rising prominence of Agent companies in the AI sector, highlighting their challenges and opportunities as they navigate the market landscape leading up to 2025 [6]. Group 1: Agent Companies and Market Trends - The acquisition of Manus by Meta for over $2 billion is seen as a milestone event in the Agent space, sparking diverse interpretations within the industry [7]. - The concept of "Situated Agency" is introduced, emphasizing that an Agent's capabilities are deeply intertwined with its environment, tools, and memory [7]. - The market acceptance of Agents has surged, with 52% of companies using generative AI deploying Agents in production environments [8]. Group 2: Investment and Capital Flow - Over 20 U.S. Agent startups raised over $100 million in funding in the past year, covering various sectors such as programming, B2B customer service, healthcare, and legal [11]. - Harvey, a legal-focused Agent company, completed a $160 million Series E funding round, achieving a valuation of $48 billion [11]. - ElevenLabs raised $100 million to transition towards a conversational Agent platform, indicating a shift in focus towards dialogue-driven applications [12]. Group 3: Sector-Specific Developments - In the legal sector, EvenUp raised $150 million to automate routine legal tasks, showcasing the growing interest in legal technology [11]. - In the search domain, Parallel and You.com both secured over $100 million in funding, reflecting the demand for Agent capabilities in search infrastructure [12]. - The healthcare sector is also seeing significant investment, with companies like Abridge raising $300 million to develop clinical dialogue Agents [15].