Workflow
多模态大模型
icon
Search documents
以玩促学?游戏代码驱动数据合成,提升多模态大模型通用推理
机器之心· 2025-07-04 08:59
Core Insights - The article presents a novel approach called Code2Logic, which utilizes game code to synthesize multimodal reasoning data, enhancing the reasoning capabilities of visual language models (VLMs) [47][48]. - The research indicates that training AI using game scenarios can significantly improve its performance in geometric and graphical reasoning tasks [1][24]. Data and Model - The scarcity of high-quality multimodal reasoning data limits the advancement of VLMs' complex reasoning abilities, prompting the need for a cost-effective method to generate such data [4]. - The research team from Fudan University and ByteDance proposes leveraging game code to automatically synthesize visual reasoning data, capitalizing on the structured nature of games [12][13]. Methodology - The Code2Logic method involves three core steps: generating game code using large language models (LLMs), designing question-answer templates from the game code, and constructing an automated data engine to generate Q&A instances [13][14][15]. - The GameQA dataset created through this method encompasses 30 games, 158 reasoning tasks, and 140,000 Q&A pairs, showcasing its scalability and diversity [18]. Training and Performance - Training on GameQA data leads to significant performance improvements in both in-domain and out-of-domain tasks, demonstrating the generalization capabilities of models trained with this dataset [24][25]. - The study reveals that models trained with GameQA outperform those trained on traditional geometric reasoning datasets, indicating the cognitive diversity and reasoning complexity inherent in game data [28][29]. Scaling Effects - The research identifies two scaling effects: increased game variety enhances out-of-domain generalization, and sample diversity correlates positively with generalization performance [37][38]. - These findings suggest that the diversity and scalability of GameQA contribute to stronger generalization in reasoning tasks [39]. Limitations and Challenges - The analysis highlights key limitations in VLMs' reasoning capabilities, particularly in 3D spatial perception, pattern recognition, and strategic planning [42][45]. - The study emphasizes the need for further improvements in models' abilities to handle complex reasoning tasks effectively [46].
小米社招&校招 | 自动驾驶与机器人具身智能算法研究员 (VLA方向)
具身智能之心· 2025-07-03 13:36
职位描述 我们正在寻找一位杰出的研究员/科学家,加入我们的前沿探索团队,共同定义和构建下一代自动驾驶与机器人 的"大脑"。您将致力于突破性的具身基座模型 (Embodied Foundation Model) 的研究,该模型将深度融合视觉-语 言-行动 (VLA) 能力,并具备卓越的空间感知与空间推理能力。 核心职责包括 前沿算法研究与构建:负责设计和实现领先的具身多模态大模型。您的研究将不仅限于现有的VLA框架,更将 探索如何构建能够理解复杂三维世界、并进行长时序、多步骤任务规划的世界模型 (World Model)。 核心模型能力攻关:主导模型在以下关键能力上的突破: 多模态场景理解:融合视觉、语言、雷达等多源信息,实现对动态、开放环境的深刻理解和空间感知。 学习与适应机制:深入研究强化学习 (RL)、模仿学习 (IL) 及自监督学习方法,使模型能从海量数据和与环境的 交互中持续学习和进化。 技术愿景与路线图:主导构建可泛化、高效率的具身智能基座模型,为未来1-3年的技术演进提供核心支撑,并 探索其在自动驾驶和通用机器人领域的统一应用潜力。 复杂语义推理与决策:让模型能够理解模糊、抽象的人类指令,并结合对 ...
vivo突破手机AI部署难题,绕开MoE架构限制,骁龙8 Elite流畅运行|ICCV 2025
量子位· 2025-07-03 09:00
Core Viewpoint - The article emphasizes the importance of deploying large models on mobile devices, particularly focusing on maintaining pure language capabilities while integrating multimodal functionalities. Group 1: Challenges in Current MLLM Deployment - Existing mobile large language models (MLLMs) face significant challenges, including a drop of over 10% in pure language task accuracy when supporting multimodal functions [3][4][6]. - Current mobile NPU platforms do not support the Mixture of Experts (MoE) architecture, which is commonly used to maintain language capabilities during multimodal training [7][8]. Group 2: GenieBlue Contributions and Technical Highlights - GenieBlue retains original language capabilities during multimodal training by freezing the original LLM parameters and introducing replicated Transformer layers along with lightweight LoRA modules [3][19]. - Through extensive fine-tuning, GenieBlue achieves multimodal capabilities comparable to mainstream MLLMs while fully preserving original pure language performance [3][19]. - GenieBlue avoids the MoE architecture limitations by employing a non-shared base inference strategy, enabling smooth operation on devices with Qualcomm Snapdragon 8 Elite (4th generation) chips [3][19]. Group 3: Training Data and Model Structure Analysis - The article discusses the limitations of simply adding pure text data to maintain language capabilities, highlighting the challenges in collecting high-quality data and the increased training time [9][12]. - It is noted that adding pure text data has limited impact on multimodal capabilities, and while it helps in objective NLP tasks, it does not significantly aid subjective tasks [11][12]. Group 4: GenieBlue Design and Deployment - GenieBlue's design is based on the CogVLM structure, focusing on separating text and multimodal information processing while avoiding MoE architecture [19][21]. - The deployment strategy involves freezing the original LLM during training and using a non-shared base approach, which effectively maintains the original language model's performance [24][26]. - GenieBlue has been validated for its multimodal and pure language accuracy, demonstrating competitive performance while being efficient for mobile NPU deployment [30][31][35]. Group 5: Performance and Efficiency - GenieBlue's multimodal accuracy is slightly lower than Qwen2.5-VL-3B but retains approximately 97% of BlueLM-V-3B's performance [31]. - In terms of pure language accuracy, GenieBlue shows no decline, contrasting with Qwen2.5-VL-3B, which experiences performance degradation in subjective tasks [33]. - The deployment efficiency of GenieBlue on Snapdragon 8 Elite shows that while there is a slight increase in loading time and memory requirements, it meets the daily usage needs of mobile devices with a speed of 30 tokens per second [35].
ICML 2025 Oral工作再升级!上海AI Lab联合复旦、港中文推出支持更长视频理解的最佳工具VideoRoPE++
机器之心· 2025-07-03 03:26
Core Viewpoint - The article discusses the development of VideoRoPE++, an advanced video position embedding strategy that effectively models spatiotemporal relationships, outperforming previous RoPE variants in various video-related tasks [4][7][34]. Background - The challenge of extending one-dimensional RoPE to the complex spatiotemporal structure of videos remains unresolved, despite the widespread adoption of RoPE due to its long-context processing capabilities [3]. Analysis - VideoRoPE++ is designed to prioritize temporal modeling through low-frequency time allocation (LTA), reducing oscillations and ensuring robustness. It employs a diagonal layout to maintain spatial symmetry and introduces adjustable time intervals (ATS) to control time spacing [15][26]. VideoRoPE++ Design - VideoRoPE++ incorporates several key features: - Low-frequency time allocation (LTA) to mitigate oscillations and ensure robustness [16]. - Adjustable time intervals (ATS) to align visual and textual markers in time [24]. - The introduction of YaRN-V, a method for extrapolating beyond training ranges while maintaining spatial structure [26]. Experimental Results - In long video retrieval tasks, VideoRoPE++ consistently outperformed other RoPE variants, demonstrating superior robustness [28]. - In long video understanding tasks, VideoRoPE++ showed significant improvements over baseline methods, highlighting its ability to capture long-distance dependencies [30]. - The extrapolation method YaRN-V achieved a score of 81.33 in the V-RULER benchmark, significantly outperforming traditional position encoding schemes [32][33]. Conclusion - The article identifies four critical standards for effective position encoding: 2D/3D structure, frequency allocation, spatial symmetry, and time index scaling. VideoRoPE++ meets these standards and excels in long video retrieval, understanding, and hallucination tasks compared to other RoPE variants [34].
大模型时代,通用视觉模型将何去何从?
机器之心· 2025-07-02 00:54
Core Viewpoint - The article discusses the evolution of Vision Generalist Models (VGM) in the context of the rise of multimodal large models, emphasizing the need for a distinct focus on visual data despite the shift towards integrating visual modalities with language models [1][2]. Group 1: VGM Overview - VGM aims to create a unified framework capable of handling various visual tasks and modalities, similar to the success of large language models in natural language processing [7]. - VGM's key capability is its ability to process multimodal inputs, including images, point clouds, and videos, through a shared representation method [7][8]. - The model supports multiple visual tasks simultaneously, allowing for parallel processing within a single framework [8]. Group 2: Data, Tasks, and Evaluation - VGM utilizes large and diverse datasets for training and evaluation, covering various types of visual data to support multimodal learning [9]. - Visual tasks are categorized into four types: image tasks, geometric tasks, time series tasks, and other visual-related tasks [9]. - Modern evaluation methods focus on cross-task generalization and multimodal processing capabilities, differing from traditional single-task assessments [9]. Group 3: Model Design Paradigms - Existing VGM design paradigms focus on unifying different visual modality inputs and diverse task outputs, primarily categorized into encoding-based frameworks and sequence-to-sequence frameworks [12][13]. - Encoding-based frameworks create a shared feature space for different input modalities, while sequence-to-sequence frameworks are suitable for tasks with variable-length inputs and outputs [12][13]. Group 4: Current Progress and Future Directions - Current VGM research has made significant progress in unified processing of multiple tasks and modalities but faces challenges in optimizing framework design and improving training efficiency [16]. - Data acquisition and annotation remain bottlenecks for VGM development, with future research likely focusing on automated annotation techniques and large-scale unsupervised learning methods [16]. - Despite challenges, VGM shows extensive potential in practical applications, extending beyond traditional visual tasks to complex multimodal tasks across various fields such as intelligent surveillance, autonomous driving, and robotics [16].
小米社招&校招 | 自动驾驶与机器人具身智能算法研究员 (VLA方向)
具身智能之心· 2025-07-01 12:07
核心职责包括 前沿算法研究与构建:负责设计和实现领先的具身多模态大模型。您的研究将不仅限于现有的VLA框架,更将 探索如何构建能够理解复杂三维世界、并进行长时序、多步骤任务规划的世界模型 (World Model)。 核心模型能力攻关:主导模型在以下关键能力上的突破: 多模态场景理解:融合视觉、语言、雷达等多源信息,实现对动态、开放环境的深刻理解和空间感知。 职位描述 我们正在寻找一位杰出的研究员/科学家,加入我们的前沿探索团队,共同定义和构建下一代自动驾驶与机器人 的"大脑"。您将致力于突破性的具身基座模型 (Embodied Foundation Model) 的研究,该模型将深度融合视觉-语 言-行动 (VLA) 能力,并具备卓越的空间感知与空间推理能力。 复杂语义推理与决策:让模型能够理解模糊、抽象的人类指令,并结合对物理世界的空间推理,生成安全、合 理、可解释的行动序列。 学习与适应机制:深入研究强化学习 (RL)、模仿学习 (IL) 及自监督学习方法,使模型能从海量数据和与环境的 交互中持续学习和进化。 技术愿景与路线图:主导构建可泛化、高效率的具身智能基座模型,为未来1-3年的技术演进提供核心支 ...
充分激发模态协作,MokA量身打造MLLM微调新范式
机器之心· 2025-06-29 02:21
Core Viewpoint - The article discusses the limitations of current multimodal large model (MLLM) fine-tuning methods, which often replicate strategies from unimodal language models without considering the unique characteristics of multimodal learning [2][9][23]. Summary by Sections Introduction to MLLMs - Recent advancements in MLLMs have been significant in tasks involving visual-language and audio-language [2]. - Current fine-tuning methods primarily adapt strategies from unimodal language models, such as LoRA, which may not be suitable for multimodal contexts [2][8]. Limitations of Current Fine-Tuning Methods - Many efficient multimodal fine-tuning methods overlook the essential differences between modalities, leading to inadequate utilization of multimodal information [9][11]. - The article emphasizes the need for both unimodal adaptation and cross-modal adaptation in effective multimodal fine-tuning [9][12]. Introduction of MokA Method - The research team proposes a new method called MokA (Multimodal low-rank Adaptation), which balances the independent modeling of unimodal information and the interaction modeling between modalities [3][12][23]. - MokA retains the efficiency of LoRA while redefining the roles of projection matrices in a multimodal context [14][23]. Key Components of MokA - MokA includes three critical modules: 1. **Modality-specific A matrix**: Ensures independent modeling of unimodal information [15]. 2. **Cross-modal attention mechanism**: Enhances interaction between different modalities during instruction tuning [16]. 3. **Shared B matrix**: Facilitates implicit cross-modal alignment by projecting modalities into a shared space [17]. Experimental Results - MokA was evaluated across three representative multimodal task scenarios: audio-visual-text, visual-text, and speech-text [19]. - The method demonstrated significant performance improvements on various benchmark datasets, showcasing its adaptability and effectiveness [19][23]. Conclusion - MokA addresses the oversight of modality differences in current fine-tuning paradigms, providing a new direction for multimodal large model fine-tuning [23].
福布斯中国“人工智能科技企业TOP 50”发布,创新集群阶梯崛起
Core Insights - The 2025 Forbes China "Top 50 AI Technology Companies" list highlights the diverse technological characteristics of selected companies, showcasing a significant presence of hard technology and internationalization, particularly in Shanghai [1][2]. Group 1: Regional Highlights - Shanghai leads with 21 companies on the list, focusing on sectors like new energy vehicles, biomedicine, robotics, and semiconductor integrated circuits [2]. - Beijing has 14 companies, with notable contributions from Cambrian's AI chips and Zhipu AI's general models, reflecting the region's emphasis on technological originality [2]. - The central region, particularly Wuhan, shows growing innovation with 9 companies, including Landing's AI cervical cancer screening system serving over 2,000 medical institutions [2][3]. Group 2: Industry Growth and Structure - Wuhan's AI industry has experienced a compound annual growth rate exceeding 40% over the past five years, with a core industry scale surpassing 70 billion [3]. - The AI industry in China is structured like a pyramid, with major players at the top, "invisible champions" in the middle, and numerous emerging companies at the base, indicating a vibrant ecosystem [4][5]. Group 3: Patent and Innovation Landscape - The top 50 companies collectively hold over 260,000 patents, with the top five companies accounting for 90% of the total, while the AIGC sector sees a 45% annual growth in software copyrights, primarily from small to medium enterprises [4]. - The coexistence of large established firms and agile startups reflects the unique vitality of the AI industry, which requires both long-term foundational research and rapid scene innovation [4][5]. Group 4: Investment Trends - Among the selected companies, 20 are publicly listed, indicating that 25% of the firms are established, while 75% are non-listed, suggesting that innovation is not monopolized by large corporations [5]. - The investment logic has shifted, focusing on the commercialization roadmap rather than just technological concepts, as seen in companies like Yuanli Wuxian and Blue Technology [5]. Group 5: Future Development Trends - The list reveals three key trends: the evolution of multimodal large models towards lightweight and industry-specific applications, the integration of quantum computing with AI chips, and the emergence of AI in healthcare, industrial robotics, and semiconductor equipment as potential investment hotspots [6][7]. - The AI industry in China is moving beyond mere technological catch-up to establish a unique industrial ecosystem, supported by the implementation of the "New Generation Artificial Intelligence Development Plan" [7].
第一篇具身领域论文应该怎么展开?
具身智能之心· 2025-06-27 09:41
Core Viewpoint - The article promotes a comprehensive tutoring service for students facing challenges in research paper writing, particularly in cutting-edge fields such as multimodal large models, embodied intelligence, and robotics [2][3][4]. Group 1: Tutoring Services Offered - The service includes one-on-one customized guidance in various advanced research areas, including multimodal large models, visual-language navigation, and robot navigation [3][4]. - The tutoring team consists of PhD researchers from prestigious institutions like CMU, Stanford, and MIT, with experience in top-tier conference reviews [4]. - The tutoring process covers the entire research paper lifecycle, from topic selection to experimental design, coding, writing, and submission strategies [4]. Group 2: Target Audience and Benefits - The service targets students struggling with research topics, data modeling, and feedback from advisors, offering a solution to enhance their academic performance [2][5]. - The first 50 students to consult can receive a free matching with a dedicated tutor for in-depth analysis and tailored advice on conference and journal submissions [5]. - The focus is not only on publishing papers but also on the practical application and value of research outcomes in industrial and academic contexts [4].