多模态大模型

Search documents
华泰证券今日早参-20250710
HTSC· 2025-07-10 01:44
Core Insights - The report highlights a potential narrowing of the decline in PPI in the second half of 2025, with June CPI showing a slight improvement to 0.1% year-on-year, compared to a previous value of -0.1% [2] - Global manufacturing PMI has rebounded above the growth line, indicating an overall recovery in manufacturing activity, particularly in developed economies [2] - The report emphasizes the importance of monitoring the performance of various sectors, particularly those expected to benefit from the "anti-involution" policies and improving economic conditions [4] Macroeconomic Overview - June CPI in China improved to 0.1% year-on-year, while PPI decreased by 3.6% year-on-year, indicating a mixed inflationary environment [2] - Global manufacturing PMI showed a notable increase, with developed markets improving while some emerging markets like Vietnam and Indonesia showed marginal declines [2] Sector Analysis Fixed Income - The report discusses the impact of "anti-involution" policies on PPI and CPI, suggesting a potential stabilization in prices, with CPI expected to rise slightly to around 0.5% by Q4 2025 [5] - The report notes that the demand side remains critical for price elasticity, with industry self-discipline and private enterprise willingness being key factors [5] Machinery and Equipment - The report indicates a recovery in excavator sales, with June sales reaching 18,800 units, a year-on-year increase of 13.3%, driven by strong export growth [8] - The growth in second-hand excavator exports is expected to stimulate domestic replacement demand, benefiting leading companies in the sector [8] Agriculture - The report highlights ongoing "anti-involution" efforts in the pig farming industry, which may lead to inventory release and improved profitability for high-quality pig farming companies [9] - The report suggests that the pig farming sector may gradually transition to a phase of high-quality competition, with recommendations for companies like Muyuan Foods and Wens Foodstuffs [9] Renewable Energy and Equipment - The report anticipates strong growth for offshore wind energy, with a significant increase in orders expected to drive performance for leading companies in the sector [19] - The report emphasizes the importance of technological advancements and capacity expansion in the offshore wind sector [19] Electronics and Chemicals - The report forecasts a substantial increase in net profit for Shengquan Group in the first half of 2025, driven by strong demand for electronic materials [20] - The report maintains a positive outlook on the company's growth trajectory, supported by favorable market conditions [20] Company-Specific Insights - Zhaojin Mining is rated as a "buy" with a target price of 23.44 HKD, driven by expected production growth and favorable gold price trends [15] - Harbin Electric is also rated as a "buy," with anticipated recovery in equipment demand across various energy sectors [15] - MGM China is highlighted for its strong performance in the non-gaming segment, benefiting from increased tourist traffic and successful entertainment events [17]
模式识别与人工智能前沿探讨专题论坛召开
Huan Qiu Wang Zi Xun· 2025-07-09 08:43
Group 1 - The forum focused on national strategic needs and technological frontiers in the fields of pattern recognition and artificial intelligence, gathering nearly 20 experts and representatives from renowned universities, research institutes, and leading enterprises in China [1][3] - The event aimed to foster the cultivation of new productive forces and interdisciplinary integration, injecting new momentum into scientific research innovation and the collaborative development of academic journals [1] Group 2 - Various professors presented specialized reports, including topics such as "3D/4D content creation for arbitrary sparse data," "embodied intelligent robots with emotional intelligence," and "visual perception in unmanned systems" [5][7][11] - A roundtable discussion was held, focusing on new trends and challenges in multimodal large models and generative artificial intelligence, addressing the transformation of research paradigms and talent cultivation in the era of large models [15]
多模态模型学会“按需搜索”,少搜30%还更准!字节&NTU新研究优化多模态模型搜索策略
量子位· 2025-07-08 07:30
MMSearch-R1团队 投稿 量子位 | 公众号 QbitAI 多模态模型学会"按需搜索"! 字节&NTU最新研究, 优化 多模态模型搜索策 略 —— 通过搭建网络搜索工具、构建多模态搜索数据集以及涉及简单有效的奖励机制,首次尝试 基于端到端强化学习的多模态模型自主搜索训练 。 经过训练的模型能够自主判断搜索时机、搜索内容并处理搜索结果,在真实互联网环境中执行多轮按需搜索。 实验结果表明,在知识密集型视觉问答任务 (Visual Question Answering, VQA) 中,MMSearch-R1系统展现出显著优势: 其性能不仅超越同规模模型在传统检索增强生成 (RAG) 工作流下的性能,更 在减少约30%搜索次数的前提 下 , 达 到了更大规模规模模 型做传统RAG的性能水平。 下文将详细解析该研究的研究方法以及实验发现。 具体怎么做到的? 近年来,随着视觉-语言训练数据集在规模和质量上的双重提升,多模态大模型 (Large Multimodal Models, LMMs) 在跨模态理解任务中 展现出卓越的性能,其文本与视觉知识的对齐能力显著增强。 然而,现实世界的信息具有高度动态性和复杂性,单 ...
Z Tech|全球领先的多模态大模型VAST顶薪招募,定义未来十年的技术范式
Z Potentials· 2025-07-08 02:50
Group 1 - The company is currently recruiting a new batch of interns to enhance its workforce and bring in fresh talent [2] - The company is seeking creative individuals from the post-00s generation to drive entrepreneurial initiatives [4] - Z Potentials is a focus area for the company, indicating a strategic interest in developing new opportunities and innovations [5]
复杂空间指令也能秒懂?RoboRefer 让机器人理解推理空间,开放世界也能精准行动!
机器之心· 2025-07-06 06:06
Core Viewpoint - The article discusses the development and capabilities of RoboRefer, a multimodal large model designed for spatial referring tasks in robotics, emphasizing its advanced spatial understanding and reasoning abilities. Group 1: RoboRefer Model Overview - RoboRefer is a multimodal large model that possesses three-dimensional spatial understanding and reasoning capabilities, featuring independent image and depth encoders [12] - The model can accurately answer various spatial perception questions and perform complex combinatorial reasoning based on multiple spatial relationships [12][13] Group 2: Training Techniques - RoboRefer employs full parameter tuning (SFT) to enhance spatial perception and reinforcement learning fine-tuning (RFT) to improve generalization reasoning capabilities [15][16] - The model's training includes a process-based reward function that enhances the quality of intermediate reasoning processes, leading to improved multi-step reasoning abilities [17] Group 3: Performance Metrics - After SFT training, RoboRefer achieved an average success rate of 89.6% in spatial understanding tasks, setting a new advanced level [21] - In the high-difficulty spatial referring task benchmark RefSpatial-Bench, RFT-trained RoboRefer outperformed all other models, surpassing Gemini-2.5-Pro by 17.4% in average accuracy [22] Group 4: Dataset Development - The research team created a large-scale, high-quality dataset called RefSpatial, which includes 2.5 million samples and 20 million question-answer pairs, significantly larger than similar datasets [20] - RefSpatial features detailed multi-step reasoning processes and covers a wide range of everyday interaction scenarios, integrating 31 types of spatial relationships [20] Group 5: Real-World Application - RoboRefer can be flexibly integrated into various types of robots, such as UR5 robotic arms and G1 humanoid robots, enabling precise execution of complex, dynamic, multi-step tasks in real-world environments [9]
以玩促学?游戏代码驱动数据合成,提升多模态大模型通用推理
机器之心· 2025-07-04 08:59
Core Insights - The article presents a novel approach called Code2Logic, which utilizes game code to synthesize multimodal reasoning data, enhancing the reasoning capabilities of visual language models (VLMs) [47][48]. - The research indicates that training AI using game scenarios can significantly improve its performance in geometric and graphical reasoning tasks [1][24]. Data and Model - The scarcity of high-quality multimodal reasoning data limits the advancement of VLMs' complex reasoning abilities, prompting the need for a cost-effective method to generate such data [4]. - The research team from Fudan University and ByteDance proposes leveraging game code to automatically synthesize visual reasoning data, capitalizing on the structured nature of games [12][13]. Methodology - The Code2Logic method involves three core steps: generating game code using large language models (LLMs), designing question-answer templates from the game code, and constructing an automated data engine to generate Q&A instances [13][14][15]. - The GameQA dataset created through this method encompasses 30 games, 158 reasoning tasks, and 140,000 Q&A pairs, showcasing its scalability and diversity [18]. Training and Performance - Training on GameQA data leads to significant performance improvements in both in-domain and out-of-domain tasks, demonstrating the generalization capabilities of models trained with this dataset [24][25]. - The study reveals that models trained with GameQA outperform those trained on traditional geometric reasoning datasets, indicating the cognitive diversity and reasoning complexity inherent in game data [28][29]. Scaling Effects - The research identifies two scaling effects: increased game variety enhances out-of-domain generalization, and sample diversity correlates positively with generalization performance [37][38]. - These findings suggest that the diversity and scalability of GameQA contribute to stronger generalization in reasoning tasks [39]. Limitations and Challenges - The analysis highlights key limitations in VLMs' reasoning capabilities, particularly in 3D spatial perception, pattern recognition, and strategic planning [42][45]. - The study emphasizes the need for further improvements in models' abilities to handle complex reasoning tasks effectively [46].
小米社招&校招 | 自动驾驶与机器人具身智能算法研究员 (VLA方向)
具身智能之心· 2025-07-03 13:36
职位描述 我们正在寻找一位杰出的研究员/科学家,加入我们的前沿探索团队,共同定义和构建下一代自动驾驶与机器人 的"大脑"。您将致力于突破性的具身基座模型 (Embodied Foundation Model) 的研究,该模型将深度融合视觉-语 言-行动 (VLA) 能力,并具备卓越的空间感知与空间推理能力。 核心职责包括 前沿算法研究与构建:负责设计和实现领先的具身多模态大模型。您的研究将不仅限于现有的VLA框架,更将 探索如何构建能够理解复杂三维世界、并进行长时序、多步骤任务规划的世界模型 (World Model)。 核心模型能力攻关:主导模型在以下关键能力上的突破: 多模态场景理解:融合视觉、语言、雷达等多源信息,实现对动态、开放环境的深刻理解和空间感知。 学习与适应机制:深入研究强化学习 (RL)、模仿学习 (IL) 及自监督学习方法,使模型能从海量数据和与环境的 交互中持续学习和进化。 技术愿景与路线图:主导构建可泛化、高效率的具身智能基座模型,为未来1-3年的技术演进提供核心支撑,并 探索其在自动驾驶和通用机器人领域的统一应用潜力。 复杂语义推理与决策:让模型能够理解模糊、抽象的人类指令,并结合对 ...
vivo突破手机AI部署难题,绕开MoE架构限制,骁龙8 Elite流畅运行|ICCV 2025
量子位· 2025-07-03 09:00
Core Viewpoint - The article emphasizes the importance of deploying large models on mobile devices, particularly focusing on maintaining pure language capabilities while integrating multimodal functionalities. Group 1: Challenges in Current MLLM Deployment - Existing mobile large language models (MLLMs) face significant challenges, including a drop of over 10% in pure language task accuracy when supporting multimodal functions [3][4][6]. - Current mobile NPU platforms do not support the Mixture of Experts (MoE) architecture, which is commonly used to maintain language capabilities during multimodal training [7][8]. Group 2: GenieBlue Contributions and Technical Highlights - GenieBlue retains original language capabilities during multimodal training by freezing the original LLM parameters and introducing replicated Transformer layers along with lightweight LoRA modules [3][19]. - Through extensive fine-tuning, GenieBlue achieves multimodal capabilities comparable to mainstream MLLMs while fully preserving original pure language performance [3][19]. - GenieBlue avoids the MoE architecture limitations by employing a non-shared base inference strategy, enabling smooth operation on devices with Qualcomm Snapdragon 8 Elite (4th generation) chips [3][19]. Group 3: Training Data and Model Structure Analysis - The article discusses the limitations of simply adding pure text data to maintain language capabilities, highlighting the challenges in collecting high-quality data and the increased training time [9][12]. - It is noted that adding pure text data has limited impact on multimodal capabilities, and while it helps in objective NLP tasks, it does not significantly aid subjective tasks [11][12]. Group 4: GenieBlue Design and Deployment - GenieBlue's design is based on the CogVLM structure, focusing on separating text and multimodal information processing while avoiding MoE architecture [19][21]. - The deployment strategy involves freezing the original LLM during training and using a non-shared base approach, which effectively maintains the original language model's performance [24][26]. - GenieBlue has been validated for its multimodal and pure language accuracy, demonstrating competitive performance while being efficient for mobile NPU deployment [30][31][35]. Group 5: Performance and Efficiency - GenieBlue's multimodal accuracy is slightly lower than Qwen2.5-VL-3B but retains approximately 97% of BlueLM-V-3B's performance [31]. - In terms of pure language accuracy, GenieBlue shows no decline, contrasting with Qwen2.5-VL-3B, which experiences performance degradation in subjective tasks [33]. - The deployment efficiency of GenieBlue on Snapdragon 8 Elite shows that while there is a slight increase in loading time and memory requirements, it meets the daily usage needs of mobile devices with a speed of 30 tokens per second [35].
ICML 2025 Oral工作再升级!上海AI Lab联合复旦、港中文推出支持更长视频理解的最佳工具VideoRoPE++
机器之心· 2025-07-03 03:26
Core Viewpoint - The article discusses the development of VideoRoPE++, an advanced video position embedding strategy that effectively models spatiotemporal relationships, outperforming previous RoPE variants in various video-related tasks [4][7][34]. Background - The challenge of extending one-dimensional RoPE to the complex spatiotemporal structure of videos remains unresolved, despite the widespread adoption of RoPE due to its long-context processing capabilities [3]. Analysis - VideoRoPE++ is designed to prioritize temporal modeling through low-frequency time allocation (LTA), reducing oscillations and ensuring robustness. It employs a diagonal layout to maintain spatial symmetry and introduces adjustable time intervals (ATS) to control time spacing [15][26]. VideoRoPE++ Design - VideoRoPE++ incorporates several key features: - Low-frequency time allocation (LTA) to mitigate oscillations and ensure robustness [16]. - Adjustable time intervals (ATS) to align visual and textual markers in time [24]. - The introduction of YaRN-V, a method for extrapolating beyond training ranges while maintaining spatial structure [26]. Experimental Results - In long video retrieval tasks, VideoRoPE++ consistently outperformed other RoPE variants, demonstrating superior robustness [28]. - In long video understanding tasks, VideoRoPE++ showed significant improvements over baseline methods, highlighting its ability to capture long-distance dependencies [30]. - The extrapolation method YaRN-V achieved a score of 81.33 in the V-RULER benchmark, significantly outperforming traditional position encoding schemes [32][33]. Conclusion - The article identifies four critical standards for effective position encoding: 2D/3D structure, frequency allocation, spatial symmetry, and time index scaling. VideoRoPE++ meets these standards and excels in long video retrieval, understanding, and hallucination tasks compared to other RoPE variants [34].
谷歌推出Gemini Robotics On-Device 大模型,快手开源 keye-VL 多模态模型:AI 动态汇总
China Post Securities· 2025-07-02 13:08
- The report does not contain any quantitative models or factors related to financial engineering or quantitative analysis [1][2][3]