Workflow
机器之心
icon
Search documents
SFT在帮倒忙?新研究:直接进行强化学习,模型多模态推理上限更高
机器之心· 2025-06-01 03:30
Core Insights - The article discusses the limitations of the "Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)" paradigm in developing large vision-language models (LVLM), suggesting that SFT may hinder learning and lead to superficial reasoning paths, while RL promotes genuine multimodal reasoning [3][11][21]. Group 1: Research Findings - A study from the University of California, Santa Cruz, and the University of Texas at Dallas reveals that SFT can obstruct learning, often resulting in "pseudo-reasoning paths" that lack depth [3][11]. - The research team created the VLAA-Thinking dataset to systematically investigate the roles of SFT and RL in multimodal reasoning, highlighting the unique contributions of each method [4][8]. - The findings indicate that while SFT improves performance on standard tasks, it falls short in enhancing complex reasoning capabilities, leading to a 47% relative performance decline in a 7B model [11][13]. Group 2: Data and Methodology - The VLAA-Thinking dataset comprises 203,182 samples, with 126,413 for SFT and 25,195 for RL, designed to facilitate high-quality reasoning chains [5][6]. - The research employed a six-stage data processing workflow to effectively transfer reasoning capabilities from pure text models to LVLMs [6][8]. - A mixed reward function was innovatively designed within the GRPO framework to optimize RL in visual contexts, incorporating various reward types for different problem categories [8][19]. Group 3: Performance Analysis - The study found that SFT's imitative reasoning patterns can limit the exploration space during the RL phase, suggesting that direct learning from reward signals is more effective [15][26]. - Models trained solely with GRPO outperformed those that underwent SFT, with the VLAA-Thinker-Qwen2.5-VL-3B model ranking first in the Open LMM reasoning leaderboard for 4B models, achieving a 1.8% record improvement [15][31]. - The analysis revealed that response length and reward scores do not correlate significantly with performance, challenging previous assumptions about their relationship [24][26]. Group 4: Implications for Future Research - The findings suggest that SFT is currently incompatible with GRPO in the context of multimodal reasoning, potentially damaging the performance of both foundational and instruction-tuned LVLMs [21][22]. - The research emphasizes the need for high-quality instruction tuning to enhance model performance in RL settings, indicating that better instruction tuning leads to improved reasoning capabilities post-RL training [31].
极低成本,复现GPT-4o图像风格化一致性!NUS推出OmniConsistency
机器之心· 2025-06-01 03:30
本文由 NUS ShowLab 主导完成。第一作者宋亦仁为新加坡国立大学 ShowLab@NUS 在读博士生,研究方向是视觉生成和多模态,在 CVPR、SIGGRAPH、 NeurIPS 等国际顶级会议上发表多篇研究成果。共同一作刘成为 NUS 重庆研究院四年级本科生,研究方向是视觉生成。项目负责作者为该校校长青年教授寿政。 不久前,GPT-4o 的最新图像风格化与编辑能力横空出世,用吉卜力等风格生成的效果令人惊艳,也让我们清晰看到了开源社区与商业 API 在图像风格化一致性上 的巨大差距。 目前,开源扩散模型在 image-to-image 风格迁移中普遍面临一个跷跷板困境:要想增强风格化效果,往往会牺牲细节、结构和语义一致性;而要保持一致性,风格 表达则明显退化。 为了解决这一难题,我们提出 OmniConsistency ,利用配对数据复现 GPT-4o 的出色风格化一致性,为开源生态注入接近商业级的能力。 论文标题:OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data 我们的解决方案:Omni ...
CVPR 2025 Highlight | 提升自回归模型样例学习能力,Few-shot图像编辑新范式开源
机器之心· 2025-06-01 03:30
Core Viewpoint - The article discusses the development of a new autoregressive model called InstaManip, which enhances in-context learning capabilities to better address the challenges of few-shot image editing [26]. Summary by Sections Introduction - Recent advancements in diffusion models have significantly improved text-guided image editing algorithms, but performance declines when user requests are difficult to describe or deviate from the training data distribution [1][2]. Problem Statement - The challenge arises when users want to edit images in ways that are not well-represented in the training dataset, such as transforming a regular car into a Lamborghini, which is hard to describe accurately with words alone [1]. Proposed Solution - To tackle this issue, the article suggests providing additional image examples alongside text instructions, allowing the model to learn desired transformations through few-shot image editing [2]. Model Structure and Methodology - The InstaManip model employs a novel group self-attention mechanism to learn image transformation features from both text and image examples, enabling it to edit new input images accordingly [6][15]. Learning Mechanism - The learning process is divided into two stages: the learning phase, where transferable knowledge is abstracted from examples, and the application phase, where this knowledge is applied to new scenarios [10][11]. Group Self-Attention Mechanism - The model incorporates multiple layers of group self-attention, which allows it to process text instructions and example images separately, enhancing the learning and application phases [16]. Relation Regularization - To mitigate noise from example images that could mislead the model, a relation regularization technique is introduced, aligning the learned similarities with those derived from text instructions [17]. Experimental Results - InstaManip outperforms previous models in both in-distribution and out-of-distribution settings, establishing itself as the state-of-the-art method for few-shot image editing [19][20]. Ablation Studies - Ablation experiments demonstrate that both the group self-attention mechanism and relation regularization significantly enhance model performance, confirming the necessity of each component [21][22]. Conclusion - The InstaManip model achieves superior results across multiple metrics and can further improve with an increased number of diverse example images [26].
低成本下的高性能模型,是悖论还是可能?
机器之心· 2025-05-31 17:15
Core Viewpoint - The article discusses the paradox of achieving high performance in AI models at low costs, questioning whether the decline in perceived model performance is intentional by AI companies and exploring the implications of cost-saving measures on model quality [2][3]. Group 1: Low-Cost High-Performance Models - The performance and cost dilemma of large language models (LLMs) has been a focal point of public and industry concern, with ongoing discussions about whether top model companies sacrifice precision or service stability to save on inference costs [2][3]. - Following the popularity of ChatGPT, users have expressed dissatisfaction with perceived declines in performance, citing issues such as weakened logic, increased errors, and difficulties in following instructions [2][3]. - The public's concern about companies sacrificing model performance for cost savings is supported by technical and market evidence, particularly highlighted in the controversy surrounding the DeepSeek-R1 model [3][4]. - The true "full version" of DeepSeek-R1 requires significant hardware investment, with initial costs reaching hundreds of thousands of yuan, leading some platforms to potentially use distilled versions that compromise inference capability and stability [3][4]. Group 2: Cost Management Strategies - To balance costs and performance, high-end "full version" models are not widely available, especially in a market flooded with free or low-cost services that often lack sufficient performance [6]. - AI companies are increasingly adopting model distillation or simplified models to reduce inference costs and manage financial investments [6]. - Common strategies to address cost pressures include lowering model precision through techniques such as model quantization, pruning, and knowledge distillation, which have become standard practices in the industry [6].
OpenAI未公开的o3「用图思考」技术,被小红书、西安交大尝试实现了
机器之心· 2025-05-31 06:30
OpenAI 推出的 o3 推理模型,打破了传统文字思维链的边界 —— 多模态模型首次实现将图像直接融入推理过程。它不仅 "看图",还能 "用图思考",开启了视觉与 文本推理深度融合的问题求解方式。例如,面对一张物理试卷图像,o3 能自动聚焦公式区域,分析变量关系,并结合知识库推导出答案;在解析建筑图纸时,o3 可在推理过程中旋转或裁剪局部结构,判断承重设计是否合理。这种 "Thinking with Images" 的能力,使 o3 在视觉推理基准测试 V* Bench 上准确率飙升至 95.7%,刷新了多模态模型的推理上限。 然而,OpenAI 如何赋予 o3 这一能力,学界和工业界仍不得而知。为此, 小红书团队联合西安交通大学, 采用端到端强化学习,在完全不依赖监督微调(SFT) 的前提下,激发了大模型 "以图深思" 的潜能, 构建出多模态深度思考模型 DeepEyes,首次实现了与 o3 类似的用图像进行思考的能力,并已同步开源相关技术细 节,让 "用图像思考" 不再是 OpenAI 专属。 论文地址:https://arxiv.org/abs/2505.14362 项目地址:https://visu ...
从性能到实战,怎样才算是靠谱的 Agent 产品?
机器之心· 2025-05-31 06:30
Group 1 - The core idea of the article is the introduction of Xbench, an AI benchmarking tool developed by Sequoia China, which emphasizes the importance of evaluating AI systems based on their practical utility in real-world scenarios rather than just the difficulty of the assessment questions [1][5][6] - Xbench was initiated in late 2022 as an internal tool for tracking and evaluating the capabilities of foundational models, evolving through three major updates, with the first public release planned for May 2025 [5][6] - The dual-track evaluation system of Xbench includes AGI Tracking to assess the upper limits of agent capabilities and Profession Aligned to quantify the practical value of AI systems in real-world applications [6][8] Group 2 - The Evergreen Evaluation Mechanism is a dynamic updating evaluation system designed to avoid the pitfalls of static assessments, which can lead to overfitting and rapid obsolescence [10] - The mechanism aims to regularly evaluate mainstream agent products across various sectors such as human resources, marketing, finance, law, and sales, adapting to the fast-paced evolution of agent applications [10] Group 3 - In the initial phase of testing, significant performance differences were observed among various models in recruitment and marketing tasks, with OpenAI's o3 ranking first and GPT-4o scoring the lowest due to its tendency to provide shorter answers [9] - The evaluation revealed that model size is not the sole determinant of task performance, as demonstrated by the comparable results of Google's DeepMind models [9] - Despite DeepSeek R1's strong performance in mathematical and coding benchmarks, its adaptability issues in search-centric tasks led to lower scores in this evaluation [9]
SSM+扩散模型,竟造出一种全新的「视频世界模型」
机器之心· 2025-05-31 04:00
机器之心报道 编辑:Panda 当状态空间模型遇上扩散模型,对世界模型意味着什么? 在这个 AI 技术与应用大爆发的时代,我们最不缺的就是「热词」,从自回归到扩散模型,从注意力机制到状态空间模型,从思维链到推理模型…… 有时候,其中 一些热词会聚拢一处,为 AI 世界创造出新的可能性。 原因很容易理解: 模型的注意力窗口中已经没有包含原始环境的帧了 。 虽然理论上可以通过更长的上下文窗口来扩展记忆,但这种方法有两大问题: 训练的计算成本会与上下文长度呈二次方增长,使其成本过高; 论文标题:Long-Context State-Space Video World Models 论文地址:https://arxiv.org/pdf/2505.20171 今天我们要介绍的这项研究便是如此,集齐了长上下文、状态空间模型(SSM)、扩散模型、世界模型等「热词」,创造了一种全新的「 视频世界模型 」。该研 究来自斯坦福大学、普林斯顿大学和 Adobe Research,在社交网络上引起了不少关注。 要了解这项研究的贡献,首先需要先界定一下相关概念。在这篇论文中,世界模型(world model)是指用于预测世界状态如何随 ...
从打分器到思考者:RM-R1用推理重塑模型价值判断
机器之心· 2025-05-31 04:00
「知其然,亦知其所以然。」 文章验证了三个核心发现: 1. 规模带来增益:随着模型变大、计算力增强,RM-R1 的推理链训练方法效果越好,性能几乎线性提升; 这句儒家命题强调,真正的理解不仅在于结果,更在于推理过程。如今,在大型语言模型的后训练阶段,奖励模型承担着桥接模型行为与人类价值的重要职 责;但现有模型往往只给出一个分数,却难以解释其依据。缺乏推理的奖励,就如「知其然而不知其所以然」,既难以建立信任,也难以指导更优的学习。 伊利诺伊大学香槟分校的研究团队提出了 RM-R1 框架,将奖励建模重新定义为推理任务,提出了推理奖励模型(Reasoning Reward Models, ReasRMs)。RM-R1 关注于如何通过整合推理能力来增强奖励模型,使其能够更准确地对模型输出进行评估和打分,从而更好地与人类偏好对齐。RM- R1 通过生成结构化的评估标准和推理过程,提升了奖励模型的可解释性和性能。 2. 简单套用旧 RL 策略行不通:想让模型「会推理」,得精准划分问题类型、并对推理过程进行定向蒸馏训练,才能带来真正泛化的提升; 3. 推理比直接输出答案更通用:相比传统的直接监督,RM-R1 的推理能力更稳 ...
250美元起售,还开源,Hugging Face 发布史上最亲民人形机器人
机器之心· 2025-05-31 04:00
Core Viewpoint - Hugging Face has officially open-sourced two humanoid robots, HopeJR and Reachy Mini, moving closer to Elon Musk's prediction of 10 billion humanoid robots by 2040 [1][31]. Group 1: Robot Specifications - HopeJR is a full-sized humanoid robot with 66 degrees of freedom, capable of walking and arm movement [3]. - Reachy Mini is a desktop robot that can move its head, speak, and listen, designed for testing AI applications [5][20]. Group 2: Pricing and Availability - HopeJR is priced at approximately $3,000, while Reachy Mini costs between $250 and $300, depending on tariffs [7]. - The company plans to start shipping the first batch of robots by the end of the year, with a waiting list already open [7]. Group 3: Open Source and Community Impact - The open-sourcing of these robots allows anyone to assemble and understand their workings, democratizing access to robotic technology [7][28]. - Hugging Face aims to build an open-source robotics ecosystem, breaking down barriers to knowledge and technology, making robotics accessible to a wider audience [28][30]. Group 4: Development and Features - HopeJR requires developers to manually control it and record actions for training through imitation learning algorithms [10][12]. - Reachy Mini is designed to help develop AI applications, allowing for testing before deployment in real-world scenarios [20]. Group 5: Previous Initiatives - This is not Hugging Face's first venture into robotics; they previously launched the LeRobot project and the SO-100 robotic arm design [26][28].
具身进化·无界未来:这场论坛引领具身智能模型革命新浪潮
机器之心· 2025-05-30 09:33
机器之心报道 机器之心编辑部 具身智能持续进化的浪潮下, "具身 AI 模型 +人形机器人"为 AGI 走进物理世界提供了更多可能。多模态大模型的兴起为具身 AI 注入强劲动力,世界模型 的出现也为具身智能训练和测试提供了新范式。如何让机器智能不仅「看懂」物理世界,更能像人类一样理解、规划并操作,是当下学术和业界共同面临的 挑战和机遇。 5 月 29 日,2025 张江具身智能开发者大会暨国际人形机器人技能大赛在上海浦东张江科学会堂顺利举行。作为大会重要组成模块, "具身·无界:智能模 型的范式创新与架构革命"论坛(以下简称"论坛")在上海市经济和信息化委员会、上海市浦东新区人民政府指导下,由上海张江(集团)有限公司主办, 上海张江数智经济发展有限公司、机器之心承办,上海市浦东新区工商联张江人工智能商会协办。 本场论坛汇聚顶尖技术专家、知名高校学者、具身智能明星厂商代表等 10 余位重磅嘉宾,行业领袖深度洞察,技术大咖同台论道,深入探讨具身 AI 与世 界模型、分层决策与端到端路线、具身智能 Scaling Law 等业界热点话题,带来 五 场精彩的主题演讲与一场高质量圆桌对话,论坛由机器之心副主编谢文 菲主 ...