多模态推理

Search documents
SFT在帮倒忙?新研究:直接进行强化学习,模型多模态推理上限更高
机器之心· 2025-06-01 03:30
Core Insights - The article discusses the limitations of the "Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)" paradigm in developing large vision-language models (LVLM), suggesting that SFT may hinder learning and lead to superficial reasoning paths, while RL promotes genuine multimodal reasoning [3][11][21]. Group 1: Research Findings - A study from the University of California, Santa Cruz, and the University of Texas at Dallas reveals that SFT can obstruct learning, often resulting in "pseudo-reasoning paths" that lack depth [3][11]. - The research team created the VLAA-Thinking dataset to systematically investigate the roles of SFT and RL in multimodal reasoning, highlighting the unique contributions of each method [4][8]. - The findings indicate that while SFT improves performance on standard tasks, it falls short in enhancing complex reasoning capabilities, leading to a 47% relative performance decline in a 7B model [11][13]. Group 2: Data and Methodology - The VLAA-Thinking dataset comprises 203,182 samples, with 126,413 for SFT and 25,195 for RL, designed to facilitate high-quality reasoning chains [5][6]. - The research employed a six-stage data processing workflow to effectively transfer reasoning capabilities from pure text models to LVLMs [6][8]. - A mixed reward function was innovatively designed within the GRPO framework to optimize RL in visual contexts, incorporating various reward types for different problem categories [8][19]. Group 3: Performance Analysis - The study found that SFT's imitative reasoning patterns can limit the exploration space during the RL phase, suggesting that direct learning from reward signals is more effective [15][26]. - Models trained solely with GRPO outperformed those that underwent SFT, with the VLAA-Thinker-Qwen2.5-VL-3B model ranking first in the Open LMM reasoning leaderboard for 4B models, achieving a 1.8% record improvement [15][31]. - The analysis revealed that response length and reward scores do not correlate significantly with performance, challenging previous assumptions about their relationship [24][26]. Group 4: Implications for Future Research - The findings suggest that SFT is currently incompatible with GRPO in the context of multimodal reasoning, potentially damaging the performance of both foundational and instruction-tuned LVLMs [21][22]. - The research emphasizes the need for high-quality instruction tuning to enhance model performance in RL settings, indicating that better instruction tuning leads to improved reasoning capabilities post-RL training [31].
ICML 2025 Spotlight | 多模态大模型暴露短板?EMMA基准深度揭秘多模态推理能力
机器之心· 2025-05-20 04:58
「三个点电荷 + Q、-2Q 和 + 3Q 等距放置,哪个向量最能描述作用在 + Q 电荷上的净电力方向?」 在解这道题时,我们可以通过绘制受力分析草图轻松解决。但即使是先进的多模态大语言模型,如 GPT-4o,也可能在理解「同性相斥」的基本物理原则时,错误 地判断斥力的方向(例如,错误地将 + 3Q 对 + Q 的斥力方向判断为右下方而非正确的左上方)。 这个看似简单的物理问题,却暴露了多模态大模型一个「致命缺陷」: 当前的 MLLMs 仍然无法进行需要深度视觉与文本融合的复杂多模态推理 !一项最新研究 推出的 EMMA 基准测试,如同一面「照妖镜」,揭示了即使是顶尖 MLLMs 也在这关键能力上显著不足。 目前该研究已被 ICML 2025 接收为 spotlight,代码数据已全部开源 ! 目前已有多个模型 / 方法在 EMMA 上验证其多模态推理能力,研究发现: 即使最先进的模型 ——Gemini-2.5-pro-exp-03-25 ,或者是能够进行视觉工具调用的 o3/o4-mini 模型在 EMMA 上的表现仍然落后人类专家超 20% ! 标题: Can MLLMs Reason in Multi ...
看图猜位置不输o3!字节发布Seed1.5-VL多模态推理模型,在60个主流基准测试中拿下38项第一
量子位· 2025-05-14 06:07
一水 发自 凹非寺 量子位 | 公众号 QbitAI 在60个主流基准测试中拿下38项第一! 字节发布 轻量级多模态推理模型Seed1.5-VL ,仅用 532M视觉编码器+200亿活跃参数 就能与一众规模更大的顶尖模型掰手腕,还是能带 图深度思考的那种。 相关技术报告也第一时间公开了。 整体而言,虽然是"以小博大",但新模型在复杂谜题推理、OCR、图表理解、3D空间理解等方面表现出色。 比如猜下图中有几只猫,人眼很容易误将地上的黑猫当成影子: 同时也能用来解答复杂推理谜题,考公党有福了(bushi~ 还能用来玩"看图找茬",速度和准确率双双胜于人类: 当然,以上也基于其强大的OCR识别能力。即便是长度惊人、中英混杂的消费小票,也能分分钟转换成表格。 那么它是如何做到的呢? 532M视觉编码器 + 20B混合专家语言模型 通过深扒技术报告,背后关键主要在于 模型架构 和 训练细节 。 据介绍,Seed1.5-VL由以下三个核心组件组成: SeedViT:用于对图像和视频进行编码; MLP适配器:将视觉特征投射为多模态token; 大语言模型:用于处理多模态输入并执行推理。 模型支持多种分辨率的图像输入,并通过 ...
昆仑万维:一季度营收大幅增长46% AI算力芯片取得突破性进展
Zheng Quan Shi Bao Wang· 2025-04-29 02:00
Core Viewpoint - Kunlun Wanwei (300418.SZ) reported a significant revenue growth of 46% year-on-year in Q1 2025, driven by advancements in AI computing chips and applications [1] Group 1: Financial Performance - The company achieved an operating revenue of 1.76 billion yuan in Q1 2025, marking a 46% increase compared to the previous year [1] - R&D expenses reached 430 million yuan, reflecting a 23% year-on-year growth [1] - The annual recurring revenue (ARR) for AI music reached approximately 12 million USD, with a monthly revenue of about 1 million USD [1] - The ARR for the short drama platform Dramawave was approximately 120 million USD, with a monthly revenue of around 10 million USD [1] - Overseas business revenue amounted to 1.67 billion yuan, showing a 56% increase year-on-year, and accounted for 94% of total revenue [1] Group 2: Technological Advancements - The company launched several disruptive technologies in multi-modal reasoning, video generation, and audio generation, achieving state-of-the-art (SOTA) status in various models [2] - The Skywork R1V multi-modal reasoning model reached open-source SOTA, while the SkyReels-V1 model and SkyReels-A1 algorithm led the global video generation field [2] - In the AI music sector, the Mureka V6 and Mureka O1 models demonstrated a competitive edge, with Mureka O1 surpassing competitors in performance [2] Group 3: AI Chip Development - The company made significant progress in the R&D of AI computing chips, moving towards the goal of "Chinese chips, Kunlun manufacturing" [3] - Kunlun Wanwei acquired a controlling stake in Beijing Aijietek Technology Co., Ltd., completing a full industry chain layout from computing infrastructure to AI applications [3] - The R&D team for AI chips has expanded to nearly 200 employees, covering various fields such as chip design and algorithm development [3] Group 4: Future Prospects - The company plans to launch the Skywork.ai platform in mid-May 2025, which will feature a system of five expert-level AI agents for optimizing various professional tasks [3] - The Opera business segment, including overseas information distribution and metaverse operations, saw a revenue increase of 41% driven by Opera Ads [4] - The company aims to continue advancing AI computing chip development and innovate its AI application matrix to provide leading AI product experiences globally [4]
AI动态跟踪系列(六):OpenAIo3、豆包新品首发,关注原生Agent与多模态推理
Ping An Securities· 2025-04-17 13:10
Investment Rating - The industry investment rating is "Outperform the Market" [1][38]. Core Insights - OpenAI's latest models, o3 and o4-mini, introduce significant advancements in image reasoning and agent capabilities, enhancing the AI programming ecosystem [3][4]. - The competition in the global large model field remains intense, with a strong emphasis on native agent capabilities and multimodal reasoning [34]. - The domestic AI computing power market is expected to see increased acceptance and market share for Chinese AI computing solutions due to ongoing global trade tensions [34]. Summary by Sections OpenAI's New Models - OpenAI released o3 and o4-mini, which are touted as the most intelligent models to date, featuring breakthroughs in image reasoning and agent capabilities [3][4]. - The o3 model has set new state-of-the-art benchmarks in coding, mathematics, and visual perception tasks, outperforming its predecessor o1 by 20% in error rates on complex tasks [5][7]. - The o4-mini model is optimized for fast and cost-effective reasoning, excelling in non-STEM tasks and data science [5]. Doubao 1.5 Model - Doubao 1.5 has reached or is close to the top tier globally in reasoning tasks across mathematics, coding, and science, with enhanced visual understanding capabilities [17][21]. - The Doubao APP, based on the Doubao 1.5 model, can perform "thinking while searching," providing detailed recommendations based on user needs [24][27]. - Doubao's daily token usage has surged to over 12.7 trillion, indicating significant growth and market penetration [18]. Investment Recommendations - The report suggests focusing on AI applications in enterprise services, programming, and office automation, as well as on domestic AI computing power companies [34]. - Recommended stocks in AI applications include companies like Fanwei Network and Kingdee International, while AI computing power recommendations include companies like Haiguang Information and Inspur Information [34].