Workflow
多模态大模型
icon
Search documents
多模态模型学会“按需搜索”,少搜30%还更准!字节&NTU新研究优化多模态模型搜索策略
量子位· 2025-07-08 07:30
MMSearch-R1团队 投稿 量子位 | 公众号 QbitAI 多模态模型学会"按需搜索"! 字节&NTU最新研究, 优化 多模态模型搜索策 略 —— 通过搭建网络搜索工具、构建多模态搜索数据集以及涉及简单有效的奖励机制,首次尝试 基于端到端强化学习的多模态模型自主搜索训练 。 经过训练的模型能够自主判断搜索时机、搜索内容并处理搜索结果,在真实互联网环境中执行多轮按需搜索。 实验结果表明,在知识密集型视觉问答任务 (Visual Question Answering, VQA) 中,MMSearch-R1系统展现出显著优势: 其性能不仅超越同规模模型在传统检索增强生成 (RAG) 工作流下的性能,更 在减少约30%搜索次数的前提 下 , 达 到了更大规模规模模 型做传统RAG的性能水平。 下文将详细解析该研究的研究方法以及实验发现。 具体怎么做到的? 近年来,随着视觉-语言训练数据集在规模和质量上的双重提升,多模态大模型 (Large Multimodal Models, LMMs) 在跨模态理解任务中 展现出卓越的性能,其文本与视觉知识的对齐能力显著增强。 然而,现实世界的信息具有高度动态性和复杂性,单 ...
Z Tech|全球领先的多模态大模型VAST顶薪招募,定义未来十年的技术范式
Z Potentials· 2025-07-08 02:50
Group 1 - The company is currently recruiting a new batch of interns to enhance its workforce and bring in fresh talent [2] - The company is seeking creative individuals from the post-00s generation to drive entrepreneurial initiatives [4] - Z Potentials is a focus area for the company, indicating a strategic interest in developing new opportunities and innovations [5]
复杂空间指令也能秒懂?RoboRefer 让机器人理解推理空间,开放世界也能精准行动!
机器之心· 2025-07-06 06:06
Core Viewpoint - The article discusses the development and capabilities of RoboRefer, a multimodal large model designed for spatial referring tasks in robotics, emphasizing its advanced spatial understanding and reasoning abilities. Group 1: RoboRefer Model Overview - RoboRefer is a multimodal large model that possesses three-dimensional spatial understanding and reasoning capabilities, featuring independent image and depth encoders [12] - The model can accurately answer various spatial perception questions and perform complex combinatorial reasoning based on multiple spatial relationships [12][13] Group 2: Training Techniques - RoboRefer employs full parameter tuning (SFT) to enhance spatial perception and reinforcement learning fine-tuning (RFT) to improve generalization reasoning capabilities [15][16] - The model's training includes a process-based reward function that enhances the quality of intermediate reasoning processes, leading to improved multi-step reasoning abilities [17] Group 3: Performance Metrics - After SFT training, RoboRefer achieved an average success rate of 89.6% in spatial understanding tasks, setting a new advanced level [21] - In the high-difficulty spatial referring task benchmark RefSpatial-Bench, RFT-trained RoboRefer outperformed all other models, surpassing Gemini-2.5-Pro by 17.4% in average accuracy [22] Group 4: Dataset Development - The research team created a large-scale, high-quality dataset called RefSpatial, which includes 2.5 million samples and 20 million question-answer pairs, significantly larger than similar datasets [20] - RefSpatial features detailed multi-step reasoning processes and covers a wide range of everyday interaction scenarios, integrating 31 types of spatial relationships [20] Group 5: Real-World Application - RoboRefer can be flexibly integrated into various types of robots, such as UR5 robotic arms and G1 humanoid robots, enabling precise execution of complex, dynamic, multi-step tasks in real-world environments [9]
以玩促学?游戏代码驱动数据合成,提升多模态大模型通用推理
机器之心· 2025-07-04 08:59
Core Insights - The article presents a novel approach called Code2Logic, which utilizes game code to synthesize multimodal reasoning data, enhancing the reasoning capabilities of visual language models (VLMs) [47][48]. - The research indicates that training AI using game scenarios can significantly improve its performance in geometric and graphical reasoning tasks [1][24]. Data and Model - The scarcity of high-quality multimodal reasoning data limits the advancement of VLMs' complex reasoning abilities, prompting the need for a cost-effective method to generate such data [4]. - The research team from Fudan University and ByteDance proposes leveraging game code to automatically synthesize visual reasoning data, capitalizing on the structured nature of games [12][13]. Methodology - The Code2Logic method involves three core steps: generating game code using large language models (LLMs), designing question-answer templates from the game code, and constructing an automated data engine to generate Q&A instances [13][14][15]. - The GameQA dataset created through this method encompasses 30 games, 158 reasoning tasks, and 140,000 Q&A pairs, showcasing its scalability and diversity [18]. Training and Performance - Training on GameQA data leads to significant performance improvements in both in-domain and out-of-domain tasks, demonstrating the generalization capabilities of models trained with this dataset [24][25]. - The study reveals that models trained with GameQA outperform those trained on traditional geometric reasoning datasets, indicating the cognitive diversity and reasoning complexity inherent in game data [28][29]. Scaling Effects - The research identifies two scaling effects: increased game variety enhances out-of-domain generalization, and sample diversity correlates positively with generalization performance [37][38]. - These findings suggest that the diversity and scalability of GameQA contribute to stronger generalization in reasoning tasks [39]. Limitations and Challenges - The analysis highlights key limitations in VLMs' reasoning capabilities, particularly in 3D spatial perception, pattern recognition, and strategic planning [42][45]. - The study emphasizes the need for further improvements in models' abilities to handle complex reasoning tasks effectively [46].
小米社招&校招 | 自动驾驶与机器人具身智能算法研究员 (VLA方向)
具身智能之心· 2025-07-03 13:36
职位描述 我们正在寻找一位杰出的研究员/科学家,加入我们的前沿探索团队,共同定义和构建下一代自动驾驶与机器人 的"大脑"。您将致力于突破性的具身基座模型 (Embodied Foundation Model) 的研究,该模型将深度融合视觉-语 言-行动 (VLA) 能力,并具备卓越的空间感知与空间推理能力。 核心职责包括 前沿算法研究与构建:负责设计和实现领先的具身多模态大模型。您的研究将不仅限于现有的VLA框架,更将 探索如何构建能够理解复杂三维世界、并进行长时序、多步骤任务规划的世界模型 (World Model)。 核心模型能力攻关:主导模型在以下关键能力上的突破: 多模态场景理解:融合视觉、语言、雷达等多源信息,实现对动态、开放环境的深刻理解和空间感知。 学习与适应机制:深入研究强化学习 (RL)、模仿学习 (IL) 及自监督学习方法,使模型能从海量数据和与环境的 交互中持续学习和进化。 技术愿景与路线图:主导构建可泛化、高效率的具身智能基座模型,为未来1-3年的技术演进提供核心支撑,并 探索其在自动驾驶和通用机器人领域的统一应用潜力。 复杂语义推理与决策:让模型能够理解模糊、抽象的人类指令,并结合对 ...
vivo突破手机AI部署难题,绕开MoE架构限制,骁龙8 Elite流畅运行|ICCV 2025
量子位· 2025-07-03 09:00
Core Viewpoint - The article emphasizes the importance of deploying large models on mobile devices, particularly focusing on maintaining pure language capabilities while integrating multimodal functionalities. Group 1: Challenges in Current MLLM Deployment - Existing mobile large language models (MLLMs) face significant challenges, including a drop of over 10% in pure language task accuracy when supporting multimodal functions [3][4][6]. - Current mobile NPU platforms do not support the Mixture of Experts (MoE) architecture, which is commonly used to maintain language capabilities during multimodal training [7][8]. Group 2: GenieBlue Contributions and Technical Highlights - GenieBlue retains original language capabilities during multimodal training by freezing the original LLM parameters and introducing replicated Transformer layers along with lightweight LoRA modules [3][19]. - Through extensive fine-tuning, GenieBlue achieves multimodal capabilities comparable to mainstream MLLMs while fully preserving original pure language performance [3][19]. - GenieBlue avoids the MoE architecture limitations by employing a non-shared base inference strategy, enabling smooth operation on devices with Qualcomm Snapdragon 8 Elite (4th generation) chips [3][19]. Group 3: Training Data and Model Structure Analysis - The article discusses the limitations of simply adding pure text data to maintain language capabilities, highlighting the challenges in collecting high-quality data and the increased training time [9][12]. - It is noted that adding pure text data has limited impact on multimodal capabilities, and while it helps in objective NLP tasks, it does not significantly aid subjective tasks [11][12]. Group 4: GenieBlue Design and Deployment - GenieBlue's design is based on the CogVLM structure, focusing on separating text and multimodal information processing while avoiding MoE architecture [19][21]. - The deployment strategy involves freezing the original LLM during training and using a non-shared base approach, which effectively maintains the original language model's performance [24][26]. - GenieBlue has been validated for its multimodal and pure language accuracy, demonstrating competitive performance while being efficient for mobile NPU deployment [30][31][35]. Group 5: Performance and Efficiency - GenieBlue's multimodal accuracy is slightly lower than Qwen2.5-VL-3B but retains approximately 97% of BlueLM-V-3B's performance [31]. - In terms of pure language accuracy, GenieBlue shows no decline, contrasting with Qwen2.5-VL-3B, which experiences performance degradation in subjective tasks [33]. - The deployment efficiency of GenieBlue on Snapdragon 8 Elite shows that while there is a slight increase in loading time and memory requirements, it meets the daily usage needs of mobile devices with a speed of 30 tokens per second [35].
ICML 2025 Oral工作再升级!上海AI Lab联合复旦、港中文推出支持更长视频理解的最佳工具VideoRoPE++
机器之心· 2025-07-03 03:26
Core Viewpoint - The article discusses the development of VideoRoPE++, an advanced video position embedding strategy that effectively models spatiotemporal relationships, outperforming previous RoPE variants in various video-related tasks [4][7][34]. Background - The challenge of extending one-dimensional RoPE to the complex spatiotemporal structure of videos remains unresolved, despite the widespread adoption of RoPE due to its long-context processing capabilities [3]. Analysis - VideoRoPE++ is designed to prioritize temporal modeling through low-frequency time allocation (LTA), reducing oscillations and ensuring robustness. It employs a diagonal layout to maintain spatial symmetry and introduces adjustable time intervals (ATS) to control time spacing [15][26]. VideoRoPE++ Design - VideoRoPE++ incorporates several key features: - Low-frequency time allocation (LTA) to mitigate oscillations and ensure robustness [16]. - Adjustable time intervals (ATS) to align visual and textual markers in time [24]. - The introduction of YaRN-V, a method for extrapolating beyond training ranges while maintaining spatial structure [26]. Experimental Results - In long video retrieval tasks, VideoRoPE++ consistently outperformed other RoPE variants, demonstrating superior robustness [28]. - In long video understanding tasks, VideoRoPE++ showed significant improvements over baseline methods, highlighting its ability to capture long-distance dependencies [30]. - The extrapolation method YaRN-V achieved a score of 81.33 in the V-RULER benchmark, significantly outperforming traditional position encoding schemes [32][33]. Conclusion - The article identifies four critical standards for effective position encoding: 2D/3D structure, frequency allocation, spatial symmetry, and time index scaling. VideoRoPE++ meets these standards and excels in long video retrieval, understanding, and hallucination tasks compared to other RoPE variants [34].
大模型时代,通用视觉模型将何去何从?
机器之心· 2025-07-02 00:54
Core Viewpoint - The article discusses the evolution of Vision Generalist Models (VGM) in the context of the rise of multimodal large models, emphasizing the need for a distinct focus on visual data despite the shift towards integrating visual modalities with language models [1][2]. Group 1: VGM Overview - VGM aims to create a unified framework capable of handling various visual tasks and modalities, similar to the success of large language models in natural language processing [7]. - VGM's key capability is its ability to process multimodal inputs, including images, point clouds, and videos, through a shared representation method [7][8]. - The model supports multiple visual tasks simultaneously, allowing for parallel processing within a single framework [8]. Group 2: Data, Tasks, and Evaluation - VGM utilizes large and diverse datasets for training and evaluation, covering various types of visual data to support multimodal learning [9]. - Visual tasks are categorized into four types: image tasks, geometric tasks, time series tasks, and other visual-related tasks [9]. - Modern evaluation methods focus on cross-task generalization and multimodal processing capabilities, differing from traditional single-task assessments [9]. Group 3: Model Design Paradigms - Existing VGM design paradigms focus on unifying different visual modality inputs and diverse task outputs, primarily categorized into encoding-based frameworks and sequence-to-sequence frameworks [12][13]. - Encoding-based frameworks create a shared feature space for different input modalities, while sequence-to-sequence frameworks are suitable for tasks with variable-length inputs and outputs [12][13]. Group 4: Current Progress and Future Directions - Current VGM research has made significant progress in unified processing of multiple tasks and modalities but faces challenges in optimizing framework design and improving training efficiency [16]. - Data acquisition and annotation remain bottlenecks for VGM development, with future research likely focusing on automated annotation techniques and large-scale unsupervised learning methods [16]. - Despite challenges, VGM shows extensive potential in practical applications, extending beyond traditional visual tasks to complex multimodal tasks across various fields such as intelligent surveillance, autonomous driving, and robotics [16].
小米社招&校招 | 自动驾驶与机器人具身智能算法研究员 (VLA方向)
具身智能之心· 2025-07-01 12:07
核心职责包括 前沿算法研究与构建:负责设计和实现领先的具身多模态大模型。您的研究将不仅限于现有的VLA框架,更将 探索如何构建能够理解复杂三维世界、并进行长时序、多步骤任务规划的世界模型 (World Model)。 核心模型能力攻关:主导模型在以下关键能力上的突破: 多模态场景理解:融合视觉、语言、雷达等多源信息,实现对动态、开放环境的深刻理解和空间感知。 职位描述 我们正在寻找一位杰出的研究员/科学家,加入我们的前沿探索团队,共同定义和构建下一代自动驾驶与机器人 的"大脑"。您将致力于突破性的具身基座模型 (Embodied Foundation Model) 的研究,该模型将深度融合视觉-语 言-行动 (VLA) 能力,并具备卓越的空间感知与空间推理能力。 复杂语义推理与决策:让模型能够理解模糊、抽象的人类指令,并结合对物理世界的空间推理,生成安全、合 理、可解释的行动序列。 学习与适应机制:深入研究强化学习 (RL)、模仿学习 (IL) 及自监督学习方法,使模型能从海量数据和与环境的 交互中持续学习和进化。 技术愿景与路线图:主导构建可泛化、高效率的具身智能基座模型,为未来1-3年的技术演进提供核心支 ...