机器之心
Search documents
苹果传统强项再发力,视觉领域三种模态终于统一
机器之心· 2025-09-22 10:27
Core Insights - The article discusses the recent release of Apple's new products and the ongoing conversation about the hardware advancements of the new phones [1] - It highlights that Apple has not yet introduced any groundbreaking AI applications, with Apple Intelligence still lagging in the domestic market [2] - The article notes a concerning trend of talent loss within Apple's AI and hardware teams, suggesting a less optimistic outlook for the company [3] AI Research and Development - Despite challenges in the large model domain, Apple has a strong background in computer vision research [4] - The article emphasizes a significant pain point in building large models related to vision, as visual modalities (images, videos, and 3D) require separate handling due to their different data dimensions and representation methods [4][5] - Apple’s research team has proposed ATOKEN, a unified tokenizer for vision, which addresses the core limitation of existing models by enabling unified processing across all major visual modalities while maintaining reconstruction quality and semantic understanding [5][6][8] ATOKEN Architecture - ATOKEN represents a significant innovation by introducing a shared sparse 4D latent space that allows for the representation of all visual modalities as feature-coordinate pairs [11] - The architecture utilizes a pure Transformer framework, surpassing traditional convolutional methods, and incorporates a four-stage progressive training curriculum to enhance multimodal learning without degrading single modality performance [15][16][19] - The training phases include image-based pre-training, video dynamic modeling, integration of 3D geometry, and discrete tokenization through finite scalar quantization [19][20] Performance Metrics - ATOKEN demonstrates industry-leading performance across various evaluation metrics, achieving high-quality image reconstruction and semantic understanding [21][23] - In image tokenization, ATOKEN achieved a reconstruction performance of 0.21 rFID at a 16×16 compression on ImageNet, outperforming the UniTok method [23] - For video processing, it achieved 3.01 rFVD and 33.11 PSNR on the DAVIS dataset, indicating competitive performance with specialized video models [24] - In 3D asset handling, ATOKEN achieved 28.28 PSNR on the Toys4k dataset, surpassing dedicated 3D tokenizers [29] Conclusion - The results indicate that the next generation of multimodal AI systems based on unified visual tokenization is becoming a reality, showcasing ATOKEN's capabilities in both generative and understanding tasks [26][27]
加速近5倍!北大与字节团队提出BranchGRPO,用「树形分叉 + 剪枝」重塑扩散模型对齐
机器之心· 2025-09-22 07:26
Core Insights - The article introduces BranchGRPO, a novel tree-structured reinforcement learning method developed by Peking University and ByteDance, which addresses the challenges of efficient sampling and stable optimization in human preference alignment for diffusion and flow matching models [2][9]. Group 1: Research Background and Challenges - Diffusion and flow matching models have become mainstream in visual generation due to their high fidelity, diversity, and controllability, but they often fail to align with human intentions, leading to results that deviate from aesthetic, semantic, or temporal consistency [5]. - Human Feedback Reinforcement Learning (RLHF) has been introduced to directly optimize generative models to better align outputs with human preferences [6]. - The existing Group Relative Policy Optimization (GRPO) method shows good stability and scalability in image and video generation but faces two fundamental bottlenecks: inefficiency due to sequential rollout and sparse rewards that ignore critical signals in intermediate states [8]. Group 2: BranchGRPO Methodology - BranchGRPO restructures the sampling process from a single path to a tree structure, allowing for efficient exploration and reducing redundancy in sampling [11][14]. - The method incorporates branching, reward fusion, and pruning mechanisms to enhance both speed and stability, achieving significant improvements in training efficiency and reward attribution [13][14]. - In image alignment tests, BranchGRPO demonstrated a speed increase of up to 4.7 times compared to DanceGRPO, with iteration times dropping from 698 seconds to as low as 148 seconds [15]. Group 3: Performance Metrics - In image alignment (HPDv2.1), BranchGRPO achieved a score of 0.369, surpassing DanceGRPO's score of 0.360, while also achieving the highest image reward of 1.319 [15][17]. - For video generation (WanX-1.3B), BranchGRPO produced clearer and more stable video frames compared to previous models, with iteration times reduced from approximately 20 minutes to about 8 minutes, effectively doubling training efficiency [18][19]. Group 4: Experimental Findings - Ablation studies indicate that moderate branching correlation and early dense splits accelerate reward improvement, while path-weighted reward fusion stabilizes training [23]. - The diversity of samples remains intact with MMD²≈0.019, nearly consistent with sequential sampling [24]. - BranchGRPO's efficiency allows for easy scaling of branch sizes without performance degradation, with iteration times significantly reduced even at larger sample sizes [27]. Group 5: Conclusion and Future Outlook - BranchGRPO innovatively combines efficiency and stability, transforming reward signals from a single endpoint to a continuous feedback mechanism, leading to comprehensive improvements in speed, stability, and alignment effectiveness [30]. - Future developments may include adaptive splitting and pruning strategies, potentially establishing BranchGRPO as a core method for RLHF in diffusion and flow models, enhancing human preference alignment [30].
LeCun力荐的JEPA杀入LLM,用CV的思路训练LLM,性能鲁棒性双丰收
机器之心· 2025-09-22 07:26
Core Viewpoint - The article discusses the introduction of LLM-JEPA, a new architecture that extends the Joint Embedding Predictive Architecture (JEPA) concept from the visual domain to large language models (LLMs), enhancing their performance and robustness in various tasks [8][10][12]. Group 1: Introduction of LLM-JEPA - LLM-JEPA is based on the JEPA concept, which aims to efficiently learn world knowledge by predicting future or missing features in an abstract representation space [7][8]. - The architecture successfully applies the JEPA target to LLMs by treating data pairs (text, code) as different views of the same underlying knowledge [8][10]. Group 2: Performance and Validation - Experimental results show that LLM-JEPA significantly outperforms standard LLM training objectives, demonstrating strong robustness against overfitting [10][11]. - The method has been validated across various mainstream model series and diverse datasets, including Llama3, OpenELM, and Rotten Tomatoes [11][21]. Group 3: LLM-JEPA Objective Function Design - The LLM-JEPA objective function retains the generative capabilities of LLMs while enhancing their abstraction capabilities through joint embedding predictive tasks [15][16]. - The design incorporates a loss function that balances traditional LLM loss with the JEPA target, allowing for a unified approach to different types of views [15][16]. Group 4: Empirical Results - LLM-JEPA has shown to improve fine-tuning outcomes across multiple pre-trained LLMs and datasets, with performance enhancements observed in various configurations [21][23]. - The architecture also demonstrates improved pre-training effectiveness, leading to higher quality representations compared to traditional methods [32][34]. Group 5: Future Directions and Limitations - The research team plans to conduct larger-scale tests to further explore the potential of LLM-JEPA, despite current limitations such as increased computational costs due to the need for multi-view representations [35][36]. - Concerns have been raised regarding the method's reliance on paired data, which may limit its generalizability and practical application [36].
突破后训练瓶颈?Meta超级智能实验室又一力作:CaT解决RL监督难题
机器之心· 2025-09-22 02:05
机器之心报道 机器之心编辑部 在 AI 领域,大家通常采取后训练方式来让模型获取专项技能。然而后训练一般依赖带有标注参考的监督微调,或通过可验证的程序化检查器提供奖励。 这就带来一些问题,目前许多有价值的任务可能同时缺乏这两种资源。例如在不可验证的场景中(临床、自由对话和创意写作),可能存在多个有效答案,确定 性规则检查难以实施。 在这种情况下,实践者往往只能依赖(i)繁琐的标注流程,或(ii)通过另一个 LLM 对自由形式输出进行粗略奖励。 然而,当后训练缺乏真实标注时,学习信号从何而来? 为了回答这一问题,来自牛津大学、Meta 超级智能实验室等机构的研究者提出设想: 推理计算是否可以替代缺失的监督? 本文认为答案是肯定的,他们提出了一种名为 CaT(Compute as Teacher) 的方法,核心思想是把推理时的额外计算当作教师信号,在缺乏人工标注或可验证答 案时,也能为大模型提供监督信号。 结果显示,推理时直接应用 CaT显著提升了 Gemma 3 4B、Qwen 3 4B 和 Llama 3.1 8B 的性能,即使在不可验证领域(MATH-500 最高提升 27%;HealthBench 提升 ...
EMNLP2025 | SFT与RL的结合,vivo AI Lab提出新的后训练方法
机器之心· 2025-09-22 02:05
本文的第一作者曾敏来自 vivo AI Lab,主要研究方向为大语言模型、强化学习、agent。 监督微调(SFT)和强化学习(RL)微调是大模型后训练常见的两种手段。通过强化学习微调大模型在众多 NLP 场景都取得了较好的进展,但是在文本分类场 景,强化学习未取得较大的进展,其表现往往不如监督学习。 SFT 和 RL 在训练的过程中都存在各自的特点:SFT 直接对着答案「死记硬背」,简单且有效,收敛速度快,但是泛化能力不行。而 RL 通过探索来获得答案,泛 化能力强。但强化学习只会一味地探索,而不学习答案,学习速度缓慢,可能出现长期无法得到收敛甚至最后出现训练不稳定的现象。 为了解决这些难题,最近, vivo AI Lab 算法团队 提出了一种新的大模型后训练框架 GTA,该方法可以综合发挥出 SFT 的优点和 RL 的优点,成功解决了文本分 类场景中 RL 收敛速度慢的问题。该论文已被 AI 顶级学术会议之一的 EMNLP 2025 录用。 论文标题:GTA: Supervised-Guided Reinforcement Learning for Text Classification with Lar ...
谷歌Gemini IMO和ICPC夺金功臣之一被xAI挖走,马斯克直呼:起飞
机器之心· 2025-09-21 05:26
Core Insights - The article discusses the competitive landscape in the AI industry, highlighting talent poaching among major companies like Tesla, Meta, Google, and xAI [1][2]. Group 1: Talent Movement - Ashish Kumar, head of Tesla's Optimus AI team, was recruited by Meta, while Dustin Tran, a senior researcher from Google's DeepMind, was hired by xAI [2][5]. - Dustin Tran had a significant impact at Google, contributing to the development of the Gemini models, including Gemini-0801, which topped the LMSYS leaderboard [5][9]. Group 2: Achievements and Contributions - Tran's work at Google included leading the post-training evaluation of Gemini, achieving top rankings in various benchmarks, and contributing to foundational papers in AI [7][9]. - The Gemini project underwent a transformative journey, evolving from a simple chatbot to a model capable of complex reasoning and deep thinking, despite initial skepticism from the public [9][10]. Group 3: xAI's Strategy and Developments - At xAI, Tran emphasized the company's belief in the power of computing resources and data, claiming that the team has access to an unprecedented number of chips [12]. - xAI recently launched Grok 4 Fast, a model that performs comparably to Grok 4 but at a significantly reduced cost, showcasing the company's rapid innovation capabilities [12].
工业级3D世界构建提速90倍!全新框架LatticeWorld让虚拟世界「一句话成真」
机器之心· 2025-09-21 05:26
本文的作者来自网易、香港城市大学、北京航空航天大学和清华大学等机构。本文的共同第一作者为网易互娱段颖琳、北京航空航天大学邹征夏、网易互娱谷统 伟。本文的通讯作者为香港城市大学邱爽、网易互娱陈康。 论文题目:LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation 文章链接:https://arxiv.org/pdf/2509.05263 构建一个工业级高仿真 3D 虚拟世界,需要投入多少时间与人力?如果仅需一段描述、一张草图,AI 便可快速自动生成 —— 你相信吗? 这并非科幻!最新论文提出的 LatticeWorld 框架让「指令直达场景」。该方法将大语言模型与工业级 3D 渲染引擎虚幻引擎 5(Unreal Engine 5,UE5)无缝融合, 打通工业级程序化内容生成(PCG)管线,实现让虚拟世界「一句话成真」。创作效率提升 90 倍,为 3D 世界构建带来了革命性的突破。 在具身智能、自动驾驶、游戏开发和影视制作等领域,高质量的 3D 世界构建 ...
全球双榜SOTA!明略科技专有大模型 Mano开启GUI智能操作新时代
机器之心· 2025-09-21 05:26
Core Viewpoint - Minglue Technology's proprietary GUI model, Mano, has achieved record-breaking SOTA results in the recognized benchmarks Mind2Web and OSWorld, establishing a new paradigm for GUI intelligent agents through innovations in online reinforcement learning and automatic data collection [1][14][23]. Group 1: Performance Achievements - Mano achieved a success rate of 40.1% in the OSWorld-Verified benchmark, surpassing other models such as qwen and GUI-Owl [10][19]. - In the Mind2Web benchmark, Mano demonstrated superior performance across various metrics, including element accuracy and step success rate, significantly outperforming all other SOTA methods [18][15]. - The model's success rate in OSWorld-Verified reached 41.6±0.7%, marking an approximate 7 percentage point improvement over competitors [21][19]. Group 2: Innovations and Methodology - Mano introduces online reinforcement learning as a novel training paradigm in the GUI interaction field, enhancing its performance in dynamic environments [22][23]. - The model's architecture consists of three main components: exploration module, processing flow, and optimization process, which collectively improve its reasoning and adaptability [25][26]. - The automatic data collection method developed by the technical team significantly enhances the efficiency and accuracy of data acquisition, allowing for the generation of high-quality interaction trajectory data [48][49]. Group 3: Market Context and Future Directions - The demand for AI agents is expected to surge by 2025, positioning Mano as a key player in differentiated competition by accessing data sources that other agents cannot reach [59][63]. - Minglue Technology plans to continue exploring areas such as data collection, training integration, and CAPTCHA handling to further optimize Mano for real-world applications [66].
Tool-Integrated RL 会是 Agents 应用突破 「基模能力限制」 的关键吗?
机器之心· 2025-09-21 01:30
Core Insights - The article discusses the evolution of AI agents, emphasizing the need for enhanced reasoning capabilities through Tool-Integrated Reasoning (TIR) and Reinforcement Learning (RL) to overcome limitations in current AI models [7][8][10]. Group 1: AI Agent Development - The term "Agent" has evolved, with a consensus that stronger agents must interact with the external world and take actions, moving beyond reliance on pre-trained knowledge [8][9]. - AI systems are categorized into LLM, AI Assistant, and AI Agent, with the latter gaining proactive execution capabilities [9][10]. - The shift from simple tool use to TIR is crucial for agents to handle complex tasks that require multi-step reasoning and real-time interaction [10][12]. Group 2: Tool-Integrated Reasoning (TIR) - TIR is identified as a significant research direction, allowing agents to understand goals, plan autonomously, and utilize tools effectively [10][12]. - The transition from supervised fine-tuning (SFT) to RL in TIR is driven by the need for agents to actively learn when and how to use external APIs [12][14]. - TIR enhances the capabilities of LLMs by integrating external tools, enabling them to perform tasks that were previously impossible, such as complex calculations [12][13]. Group 3: Practical Implications of TIR - TIR allows for empirical support expansion, enabling LLMs to generate previously unattainable problem-solving trajectories [12][14]. - Feasible support expansion through TIR makes complex strategies practically executable within token limits, transforming theoretical solutions into efficient strategies [14][15]. - The integration of tool usage into the reasoning process elevates the agent's ability to optimize multi-step decision-making through feedback from tool outcomes [15].
集合通信库VCCL释放GPU极致算力,创智、基流、智谱、联通、北航、清华、东南重磅开源
机器之心· 2025-09-21 00:30
机器之心发布 机器之心编辑部 计算速度与系统稳定性的双重挑战,正推动 AI 基础设施向新一代集合通信技术迈进。 在人工智能迅猛发展的今天,超大规模智算集群已成为推动技术突破的核心基础设施。 海外科技巨头纷纷布局,OpenAI 与甲骨文和软银正在推进「星际之门」项目,计划配备数百万个 GPU,预计耗资超千亿美元;微软、谷歌、xAI 陆续完成十万卡 集群交付使用。 在国内,运营商也加速向 AI 基础底座供应商转型,累计投资已超百亿元,建成 4 个万卡级智能计算中心,智算规模增长超 2 倍。 超大规模智算集群需要应对诸多挑战:硬件配套投入大、运营维护费用高。更重要的是,单纯堆砌硬件并不能解决所有问题,如何设计软件系统,将成千上万个 计算单元高度组织起来才是核心挑战。在万卡甚至百万卡规模的集群中,设备故障几乎成为常态而非例外,任何一个组件的失效都可能导致整个训练任务中断, 算力利用率 和 系统稳定性 成为比纯粹算力更为关键的指标。 AI 基础设施由计算 + 通信构成,集合通信库作为智算集群的 "神经系统",其重要性日益凸显。 集合通信库是 GPU 计算芯片与高性能网络的交汇所在,是 GPU 软 件栈基座组件。如英伟达 ...