多模态大模型
Search documents
充分激发模态协作,MokA量身打造MLLM微调新范式
机器之心· 2025-06-29 02:21
Core Viewpoint - The article discusses the limitations of current multimodal large model (MLLM) fine-tuning methods, which often replicate strategies from unimodal language models without considering the unique characteristics of multimodal learning [2][9][23]. Summary by Sections Introduction to MLLMs - Recent advancements in MLLMs have been significant in tasks involving visual-language and audio-language [2]. - Current fine-tuning methods primarily adapt strategies from unimodal language models, such as LoRA, which may not be suitable for multimodal contexts [2][8]. Limitations of Current Fine-Tuning Methods - Many efficient multimodal fine-tuning methods overlook the essential differences between modalities, leading to inadequate utilization of multimodal information [9][11]. - The article emphasizes the need for both unimodal adaptation and cross-modal adaptation in effective multimodal fine-tuning [9][12]. Introduction of MokA Method - The research team proposes a new method called MokA (Multimodal low-rank Adaptation), which balances the independent modeling of unimodal information and the interaction modeling between modalities [3][12][23]. - MokA retains the efficiency of LoRA while redefining the roles of projection matrices in a multimodal context [14][23]. Key Components of MokA - MokA includes three critical modules: 1. **Modality-specific A matrix**: Ensures independent modeling of unimodal information [15]. 2. **Cross-modal attention mechanism**: Enhances interaction between different modalities during instruction tuning [16]. 3. **Shared B matrix**: Facilitates implicit cross-modal alignment by projecting modalities into a shared space [17]. Experimental Results - MokA was evaluated across three representative multimodal task scenarios: audio-visual-text, visual-text, and speech-text [19]. - The method demonstrated significant performance improvements on various benchmark datasets, showcasing its adaptability and effectiveness [19][23]. Conclusion - MokA addresses the oversight of modality differences in current fine-tuning paradigms, providing a new direction for multimodal large model fine-tuning [23].
福布斯中国“人工智能科技企业TOP 50”发布,创新集群阶梯崛起
Zheng Quan Shi Bao Wang· 2025-06-27 14:39
Core Insights - The 2025 Forbes China "Top 50 AI Technology Companies" list highlights the diverse technological characteristics of selected companies, showcasing a significant presence of hard technology and internationalization, particularly in Shanghai [1][2]. Group 1: Regional Highlights - Shanghai leads with 21 companies on the list, focusing on sectors like new energy vehicles, biomedicine, robotics, and semiconductor integrated circuits [2]. - Beijing has 14 companies, with notable contributions from Cambrian's AI chips and Zhipu AI's general models, reflecting the region's emphasis on technological originality [2]. - The central region, particularly Wuhan, shows growing innovation with 9 companies, including Landing's AI cervical cancer screening system serving over 2,000 medical institutions [2][3]. Group 2: Industry Growth and Structure - Wuhan's AI industry has experienced a compound annual growth rate exceeding 40% over the past five years, with a core industry scale surpassing 70 billion [3]. - The AI industry in China is structured like a pyramid, with major players at the top, "invisible champions" in the middle, and numerous emerging companies at the base, indicating a vibrant ecosystem [4][5]. Group 3: Patent and Innovation Landscape - The top 50 companies collectively hold over 260,000 patents, with the top five companies accounting for 90% of the total, while the AIGC sector sees a 45% annual growth in software copyrights, primarily from small to medium enterprises [4]. - The coexistence of large established firms and agile startups reflects the unique vitality of the AI industry, which requires both long-term foundational research and rapid scene innovation [4][5]. Group 4: Investment Trends - Among the selected companies, 20 are publicly listed, indicating that 25% of the firms are established, while 75% are non-listed, suggesting that innovation is not monopolized by large corporations [5]. - The investment logic has shifted, focusing on the commercialization roadmap rather than just technological concepts, as seen in companies like Yuanli Wuxian and Blue Technology [5]. Group 5: Future Development Trends - The list reveals three key trends: the evolution of multimodal large models towards lightweight and industry-specific applications, the integration of quantum computing with AI chips, and the emergence of AI in healthcare, industrial robotics, and semiconductor equipment as potential investment hotspots [6][7]. - The AI industry in China is moving beyond mere technological catch-up to establish a unique industrial ecosystem, supported by the implementation of the "New Generation Artificial Intelligence Development Plan" [7].
第一篇具身领域论文应该怎么展开?
具身智能之心· 2025-06-27 09:41
Core Viewpoint - The article promotes a comprehensive tutoring service for students facing challenges in research paper writing, particularly in cutting-edge fields such as multimodal large models, embodied intelligence, and robotics [2][3][4]. Group 1: Tutoring Services Offered - The service includes one-on-one customized guidance in various advanced research areas, including multimodal large models, visual-language navigation, and robot navigation [3][4]. - The tutoring team consists of PhD researchers from prestigious institutions like CMU, Stanford, and MIT, with experience in top-tier conference reviews [4]. - The tutoring process covers the entire research paper lifecycle, from topic selection to experimental design, coding, writing, and submission strategies [4]. Group 2: Target Audience and Benefits - The service targets students struggling with research topics, data modeling, and feedback from advisors, offering a solution to enhance their academic performance [2][5]. - The first 50 students to consult can receive a free matching with a dedicated tutor for in-depth analysis and tailored advice on conference and journal submissions [5]. - The focus is not only on publishing papers but also on the practical application and value of research outcomes in industrial and academic contexts [4].
之心急聘!25年业务合伙人招聘,量大管饱~
自动驾驶之心· 2025-06-27 09:34
Group 1 - The article discusses the recruitment of 10 outstanding partners for the "Autonomous Driving Heart" team, focusing on the development of autonomous driving-related courses, thesis guidance, and hardware development [2][3] - The main areas of expertise sought include large models/multi-modal large models, diffusion models, VLA, end-to-end systems, embodied interaction, joint prediction, SLAM, 3D object detection, world models, closed-loop simulation 3DGS, and large model deployment and quantized perception reasoning [3] - Candidates are preferred to have a master's degree or higher from universities ranked within the QS200, with priority given to those who have significant contributions in top conferences [4] Group 2 - The company offers various benefits including resource sharing for job seeking, doctoral studies, and studying abroad recommendations, along with substantial cash incentives and opportunities for entrepreneurial project collaboration [5][6] - Interested parties are encouraged to contact the company via WeChat for consultation regarding institutional or corporate collaboration in autonomous driving [7]
基于VLM的快慢双系统自动驾驶 - DriveVLM解析~
自动驾驶之心· 2025-06-27 09:15
Core Viewpoint - The article discusses the rapid advancements in large models and their applications in the autonomous driving sector, particularly focusing on the DriveVLM algorithm developed by Tsinghua University and Li Auto to address long-tail problems in real-world driving scenarios [2]. Group 1: DriveVLM Overview - DriveVLM aims to tackle the challenges faced in the transition from Level 2 (L2) to Level 4 (L4) autonomous driving, particularly the infinite long-tail problems that arise in real-world scenarios [2]. - The industry has recognized that data-driven approaches alone may not suffice to evolve towards true L4 autonomous driving, necessitating further exploration of next-generation solutions [2]. Group 2: Innovations of DriveVLM - DriveVLM introduces several innovations, including: - Chain-of-Thought (CoT) for scene description, analysis, and hierarchical planning [4]. - DriveVLM-Dual, which integrates DriveVLM with traditional modules for real-time planning and enhanced spatial reasoning capabilities [4]. - A comprehensive data mining and annotation process to construct the Corner Case dataset, SUP-AD [4]. Group 3: Course Structure and Content - The article outlines a course on multi-modal large models, covering: - Introduction to multi-modal large models, including foundational concepts and applications [21]. - Basic modules of multi-modal large models, explaining components like modality encoders and projectors [23]. - General multi-modal large models, focusing on algorithms for various tasks [25]. - Fine-tuning and reinforcement learning techniques essential for model development [28]. - Applications of multi-modal large models in autonomous driving, highlighting DriveVLM as a key algorithm [30]. - Job preparation related to multi-modal large models, addressing industry needs and interview preparation [32].
突破通用领域推理的瓶颈!清华NLP实验室强化学习新研究RLPR
机器之心· 2025-06-27 00:49
Core Viewpoint - The article discusses the introduction of a novel reinforcement learning technique called Reinforcement Learning with Reference Probability Reward (RLPR), which addresses the limitations of existing methods in generalizing to diverse domains beyond mathematics and coding [4][24]. Group 1: RLPR Technology Overview - RLPR significantly enhances the quality of probability-based rewards through the Prob-to-Reward method, outperforming likelihood-based baseline methods in performance and training stability [7][24]. - The technology introduces a dynamic filtering mechanism based on reward standard deviation, further improving the stability and performance of reinforcement learning [8][17]. Group 2: Effectiveness of PR - The research team found that the generation probability of reference answers in large language models (LLMs) directly reflects the quality assessment of the model's reasoning process, indicating a strong correlation between the model's reasoning accuracy and the probability of generating correct reference answers [11][24]. - The PR mechanism effectively captures the model's self-assessment of reasoning quality, demonstrating its reliability in evaluating output [11][13]. Group 3: Advantages Over Existing Methods - Unlike existing RLVR methods that require extensive human resources for domain-specific validation rules, RLPR generates reward scores with a simple forward pass, making it more efficient in handling the complexity of natural language [13][24]. - RLPR's dynamic filtering mechanism retains samples with high reward standard deviation for training, enhancing training stability and effectiveness [17][24]. Group 4: Robustness and Validation - The research team evaluated the quality of different reward sources using the ROC-AUC metric, showing that PR outperformed rule-based rewards and verifier model rewards at a scale of 0.5 billion, with further improvements possible as model capabilities increase [19][21]. - RLPR demonstrated stable performance improvements across various training templates and base models, including Gemma and Llama, surpassing the performance of traditional rule-based RLVR baselines [22][24].
让多模态大模型「想明白再画」!港大等开源GoT-R1:强化学习解锁视觉生成推理新范式
机器之心· 2025-06-25 06:50
Core Viewpoint - The article discusses the significant advancements in multimodal large models for generating high-fidelity images from complex text prompts, while also highlighting the challenges faced in accurately interpreting spatial relationships and multi-object attributes [1][2]. Group 1: Introduction of GoT-R1 - A research team from the University of Hong Kong, Chinese University of Hong Kong, and SenseTime has introduced GoT-R1, an important advancement following the Generation Chain-of-Thought (GoT) framework [2]. - GoT-R1 enhances the semantic-spatial reasoning capabilities of multimodal large models through the innovative application of reinforcement learning, allowing the model to autonomously explore and learn better reasoning strategies [3][5]. Group 2: Limitations of GoT Framework - The GoT framework improves image generation accuracy and controllability by explicitly planning semantic content and spatial layout before image generation, but its reasoning capabilities are limited by supervised fine-tuning data based on predefined templates [4][13]. - GoT-R1 aims to overcome these limitations by introducing reinforcement learning into the semantic-spatial reasoning process, enabling the model to learn and optimize reasoning paths independently [5][13]. Group 3: Reward Mechanism in GoT-R1 - GoT-R1 constructs a comprehensive and effective reward mechanism for visual generation tasks, evaluating multiple dimensions of the generated results, including semantic consistency, spatial accuracy, and overall aesthetic quality [13][14]. - The reward framework includes: 1. Reasoning Process Evaluation Reward (RPR) [14] 2. Reasoning-to-Image Alignment Reward (RRI), which quantifies adherence to the reasoning chain using Intersection over Union (IoU) [15] 3. Semantic Alignment Reward (Rsem) and Spatial Alignment Reward (Rspa), which assess the completeness and accuracy of the reasoning chain against the original text prompt [16] 4. Text-to-Image Alignment Reward (RPI), which evaluates the overall consistency of the generated image with the original text prompt [17]. Group 4: Performance Evaluation of GoT-R1 - GoT-R1 was evaluated on the challenging T2I-CompBench, where it established new state-of-the-art (SOTA) performance, achieving the highest scores in five out of six evaluation categories [21][23]. - The model demonstrated significant advantages in handling complex, multi-layered instructions, particularly in the "Complex" benchmark [23]. - Compared to the baseline model, GoT-R1-7B achieved up to a 15% improvement in evaluation metrics, showcasing the effectiveness of reinforcement learning in enhancing the model's reasoning capabilities [24][25]. Group 5: Comparison of Reasoning Chains - A comparative analysis using GPT-4o revealed that GoT-R1 generated reasoning chains were preferred over those from the baseline model across all evaluation categories, particularly in spatial relationship understanding [25][26].
深入浅出完整解析LoRA(Low-Rank Adaptation)模型核心基础知识
自动驾驶之心· 2025-06-22 14:09
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 大模型高效微调已经成为业界关注的焦点,无论是通用大模型,还是智驾大模型,如何通过轻量微调变成各个不同领域的专业模型,成为 讨论的热点。所以今天就来大家一起聊聊LORA。 背景: 业内的大公司或者研究机构,都是有足够资源的来开发大模型,但是对于一般的小公司或者个人来说,要想开发自己的大模型几乎 不可能,要知道像 ChatGPT 这样的大模型,一次训练的成本就在上千万美元,而即使是DeepSeekv3,单次训练成本也在500万美元以上, 所以充分利用开源大模型,在领域任务上高效微调便成为了当下学术界和工业界迫切需要解决的问题,至此LoRA问世: LoRA 的思想很简单: 而这个降维的操作就需要用到低秩分解了,接下来我们回顾下低秩分解: * [16] A. A. K. 那么LoRA训练的思路和优势是什么呢? 在原始 PLM (Pre-trained Language Model) 旁边增加一个旁路,做一个降维再升维的操作,来模拟所谓的intrinsic rank。 训练的时候固定 PLM 的参数,只训练降维矩阵 A ...
广联达(002410) - 002410广联达投资者关系管理信息20250621
2025-06-21 13:35
Group 1: AI Strategy and Advantages - The company has developed a large model, AecGPT, specifically for the construction industry, which was released in 2024 and can pass the national construction examination with high scores [2] - Key elements for successful industrial AI include high-quality data, valuable scenarios, and reliable models [2] - The company possesses a comprehensive engineering construction knowledge base that supports the construction large model [2] Group 2: AI Application Scenarios - The company focuses on three main directions for AI scenario implementation: integrated design, refined cost management, and precise construction management [3][4] - In integrated design, AI is used to enhance design workflows and assist in construction drawing design [3] - Refined cost management leverages AI and data to drive detailed cost management throughout the project lifecycle [4] Group 3: Value Measurement and Commercialization - High-value AI applications should be able to deliver a complete task process, be measurable in value, and continuously learn and optimize [5] - The AI intelligent bidding product has been implemented in 716 construction bidding projects in Hainan, resulting in an average bid reduction rate of 8% and saving approximately CNY 4.56 billion [5] - The commercial value of AI products is closely linked to technological maturity and the ability to meet new demands [6] Group 4: Future AI Opportunities - Future high-value AI scenarios will emerge from both technological breakthroughs and evolving market demands [7] - The upcoming market reforms in September 2025 will drive the need for effective data management and cost control in the construction industry [7] - The company is developing an automatic database construction product that will enhance data collection and analysis efficiency [7]
今夏面世 OpenAI剧透GPT-5
Bei Jing Shang Bao· 2025-06-19 14:52
OpenAI联合创始人兼首席执行官山姆·奥特曼在最新播客中披露,备受关注的GPT-5预计将于今年夏季发布,目前 具体发布日期尚未确定。随着GPT-5发布时间的临近,业界普遍认为,多模态大模型领域又将迎来新一轮的技术 竞争,该模型将成为生成式人工智能能力的一次重大升级。从早期测试者的反馈来看,其性能较GPT-4有显著提 升。但也有人担忧,从去年开始GPT-5就曾屡屡跳票,这会不会又是一次"狼来了"? AI能力重大飞跃 OpenAI开启官方播客,CEO打头阵。当地时间6月18日,OpenAI发布了一则山姆·奥特曼的访谈视频。在40分钟的 专访中,奥特曼回应了大家普遍关心的GPT-5、隐私保护、广告业务、5000亿美元的投资项目"星际之门"等热点 话题。奥特曼说,GPT-5"可能是在今年夏天的某个时候"会发布,但他也同时表示,对于新模型,内部也在讨论 是简单地提升版本号,还是像GPT-4那样不断优化和改进。 奥特曼还暗示,GPT-5所代表的不仅仅是性能升级,它还可能标志着OpenAI朝着统一的、类似代理的模型迈出了 真正的第一步,此举将使其更接近其通用人工智能目标。"我认为我们已经接近这座山的尽头了",他表示。 G ...