Workflow
多模态大语言模型
icon
Search documents
视觉强化学习最新综述:全领域梳理(新加坡国立&浙大&港中文)
自动驾驶之心· 2025-08-16 00:03
图 1:代表性视觉强化学习模型时间线。该图按时间顺序概述了 2023 年至 2025 年的关键视觉强化学习(Visual RL)模型,并将其分为四个领域:多模态大语 言模型(Multimodal LLM)、视觉生成(Visual Generation)、统一模型(Unified Models)和视觉 - 语言 - 动作模型(VLA Models)。 在 大语言模型(LLM) 的江湖里, 强化学习(RL) ,特别是带有 人类反馈的强化学习(RLHF) ,早已不是什么新鲜词。正是它,如同一位内 力深厚的宗师,为 GPT、Qwen、DeepSeek 等模型注入了"灵魂",使其回答能够如此贴合人类的思维与价值观。这场由 RL 主导的革命,彻底改变 了我们与AI的交互方式。 然而,当所有人都以为强化学习的舞台仅限于文字的方寸之间时,一股同样的浪潮,正以迅雷不及掩耳之势,"卷"向了另一个更为广阔的领域—— 计算机视觉(CV) 。 点击下方 卡片 ,关注" 大模型之心Tech "公众号 戳我 -> 领取大模型巨卷干货 >> 点击进入→ 大模型技术 交流群 本文只做学术分享,如有侵权,联系删文 写在前面 当RLHF"卷入"计 ...
自动驾驶大模型方案:视觉语言模型VLM工作一览,面向量产和研究~
自动驾驶之心· 2025-08-06 23:34
Core Insights - The article emphasizes the transformative potential of Vision-Language Models (VLMs) in enhancing the perception and cognitive capabilities of autonomous driving systems, enabling them to not only "see" but also "understand" complex driving environments [2][3]. Group 1: VLM Applications in Autonomous Driving - VLMs can surpass traditional visual models by integrating camera images or video streams to comprehend semantic information in traffic scenes, such as recognizing complex scenarios like "a pedestrian waving to cross the street" [6]. - VLMs facilitate the conversion of intricate visual scenes into clear natural language descriptions, enhancing the interpretability of decisions made by autonomous systems, which aids in debugging and increases trust among passengers and regulators [6]. - VLMs are crucial for natural language interactions in future smart cabins, allowing passengers to communicate intentions to vehicles through spoken commands [6]. Group 2: Scenario Generation and Testing - The article introduces CrashAgent, a multi-agent framework that utilizes multi-modal large language models to convert accident reports into structured scenarios for simulation environments, addressing the long-tail distribution issue in existing datasets [7]. - CurricuVLM is proposed as a personalized curriculum learning framework that leverages VLMs to analyze agent behavior and dynamically generate tailored training scenarios, improving safety in autonomous driving [13]. - TRACE is a framework that generates key test cases from real accident reports, significantly enhancing the efficiency of defect detection in autonomous driving systems [17]. Group 3: Out-of-Distribution (OOD) Scenario Generation - A framework utilizing large language models is proposed to generate diverse OOD driving scenarios, addressing the challenges posed by the sparsity of such scenarios in urban driving datasets [21][22]. - The article discusses the development of a method to automatically convert real-world driving videos into detailed simulation scenarios, enhancing the testing of autonomous driving systems [26]. Group 4: Enhancing Safety and Robustness - WEDGE is introduced as a synthetic dataset created from generative vision-language models, aimed at improving the robustness of perception systems in extreme weather conditions [39][40]. - LKAlert is a predictive alert system that utilizes VLMs to forecast potential lane-keeping assist (LKA) risks, enhancing driver situational awareness and trust [54][55]. Group 5: Advancements in Decision-Making Frameworks - The CBR-LLM framework combines semantic scene understanding with case retrieval to enhance decision-making in complex driving scenarios, improving accuracy and reasoning consistency [44][45]. - ORION is presented as a holistic end-to-end autonomous driving framework that integrates visual-language instructed action generation, achieving superior performance in closed-loop evaluations [69][70].
AI打假AI,拿下SOTA丨厦大&腾讯优图
量子位· 2025-07-20 02:49
Core Viewpoint - The article discusses the innovative AIGI-Holmes method developed by Xiamen University and Tencent Youtu Lab for detecting AI-generated images, addressing the challenges of interpretability and generalization in existing detection models [2][12][36]. Group 1: Methodology - AIGI-Holmes employs a "large model + visual expert" collaborative architecture to enhance image detection capabilities [2][5]. - The method includes a dual-visual encoder architecture that integrates NPR visual experts to process both high-level semantics and low-level visual features [6]. - The Holmes Pipeline consists of three training phases: visual expert pre-training, supervised fine-tuning (SFT), and direct preference optimization (DPO) [7][22]. Group 2: Key Innovations - The AIGI-Holmes method addresses two critical bottlenecks in existing detection technologies: lack of interpretability and limited generalization capabilities [12][36]. - A new dataset, Holmes-Set, was constructed containing 45,000 images and 20,000 annotations to improve data scarcity issues, covering various types of generation defects [15][18]. - The model architecture includes a collaborative decoding strategy that merges predictions from visual experts and the large language model to enhance detection accuracy [8][25]. Group 3: Performance Evaluation - Experimental results indicate that AIGI-Holmes outperforms existing methods across all benchmarks in detection accuracy and interpretability [10][29]. - The model achieved optimal results in objective metrics (BLEU/ROUGE/METEOR/CIDEr) and subjective evaluations compared to current advanced models [31]. - In robustness tests against common distortions like JPEG compression and Gaussian blur, AIGI-Holmes maintained superior detection accuracy compared to other baseline methods [33][35]. Group 4: Future Directions - The team acknowledges limitations such as the hallucination problem, where the model may misinterpret normal features as defects, and the need for more granular understanding of visual defects [36][39]. - Future work will focus on addressing the hallucination issue, enhancing fine-grained understanding capabilities, and developing objective evaluation metrics for visual defect explanations [39].
自驾搞科研别蛮干!用对套路弯道超车~
自动驾驶之心· 2025-07-11 01:14
Core Viewpoint - The article emphasizes the importance of learning from experienced mentors in the field of research, particularly in LLM/MLLM, to accelerate the research process and achieve results more efficiently [1]. Group 1: Course Offerings - The program offers a 1v6 elite small class format, allowing for personalized guidance from a mentor throughout the research process [5]. - The course covers everything from model theory to practical coding, helping participants build their own knowledge systems and understand algorithm design and innovation in LLM/MLLM [1][10]. - Participants will receive tailored ideas from the mentor to kickstart their research, even if they lack a clear direction initially [7]. Group 2: Instructor Background - The instructor has a strong academic background, having graduated from a prestigious computer science university and worked as an algorithm researcher in various companies [2]. - The instructor's research includes computer vision, efficient model compression algorithms, and multimodal large language models, with a focus on lightweight models and efficient fine-tuning techniques [2][3]. Group 3: Target Audience - The program is suitable for graduate students and professionals in the fields of autonomous driving, AI, and those looking to enhance their algorithmic knowledge and research skills [11]. - It caters to individuals who need to publish papers for academic recognition or those who want to systematically master model compression and multimodal reasoning [11]. Group 4: Course Structure and Requirements - The course is designed to accommodate students with varying levels of foundational knowledge, with adjustments made to the depth of instruction based on participants' backgrounds [14]. - Participants are expected to have a basic understanding of deep learning and machine learning, familiarity with Python and PyTorch, and a willingness to engage actively in the learning process [16][19].
ICML 2025 | 给AI装上「智能升级插件」!阿里安全-清华大学D-MoLE让模型在持续学习中动态进化
机器之心· 2025-07-10 04:26
Core Viewpoint - The article discusses the development of a new framework called D-MoLE (Dynamic Mixture of Curriculum LoRA Experts) aimed at enhancing the continual adaptation capabilities of Multimodal Large Language Models (MLLMs) in response to evolving task requirements while preserving existing knowledge [4][12][13]. Research Background - Multimodal Large Language Models (MLLMs) combine various modalities such as visual and textual data, showcasing strong capabilities in handling multimodal information [3]. - A significant challenge in practical applications is the phenomenon of catastrophic forgetting, where models lose previously acquired knowledge when fine-tuned for new tasks [4]. Key Challenges - The need for continual multimodal instruction tuning (CMIT) arises to allow MLLMs to adapt to new tasks while retaining past knowledge [4][12]. - Two main challenges identified are task architecture conflicts and modality imbalance, where different tasks have varying dependencies on model layers and modalities [4][7]. Proposed Solution - D-MoLE framework allows dynamic adjustment of model architecture based on task requirements, introducing additional parameter modules (LoRA experts) as needed [10][13]. - It incorporates a gradient-based continual curriculum strategy to balance updates across different modalities, ensuring more equitable optimization [10][12]. Methodology - D-MoLE consists of two core components: a dynamic layer-wise expert allocator and a gradient-based inter-modal continual curriculum mechanism [16][22]. - The dynamic allocator identifies critical layers for adaptation and allocates LoRA experts accordingly, while the curriculum mechanism adjusts the update ratio between language models and modality encoders based on task difficulty [22][24]. Experimental Results - D-MoLE was evaluated on a benchmark comprising nine datasets across visual question answering, image captioning, and visual grounding [27]. - The framework demonstrated significant improvements over baseline methods, achieving an average performance increase of approximately 15.08% and reducing backward transfer (BWT) from -21.31% to -1.49% [29]. General Capability Assessment - D-MoLE maintained strong general multimodal capabilities, outperforming traditional methods in various evaluation benchmarks [30][31]. Training Efficiency - Despite the introduction of new mechanisms, D-MoLE's total training time was comparable to traditional methods, demonstrating efficiency in training through selective parameter updates [36]. Business Application - D-MoLE can enhance Alibaba's security multimodal auditing models, allowing for rapid adaptation to different platform rules without extensive retraining, thus reducing operational costs and improving flexibility [38][39].
突破全模态AI理解边界:引入上下文强化学习,赋能全模态模型“意图”推理新高度
量子位· 2025-07-08 07:30
Core Viewpoint - The article emphasizes the increasing need for deep understanding and analysis of human intent in the context of multimodal large language models (MLLMs) and highlights the challenges faced in applying reinforcement learning (RL) effectively to complex multimodal data and formats [1][4]. Group 1: Challenges in Multimodal Reasoning - Insufficient global context understanding leads to incorrect answers when models fail to accurately identify or misinterpret multimodal evidence and contextual information [3]. - The shortcut problem arises when models overlook key clues and provide answers without fully considering multimodal information, resulting in suboptimal or partial outcomes [4]. Group 2: Innovations and Advantages - HumanOmniV2 introduces a mandatory context summarization before reasoning, ensuring models do not skip critical multimodal input and providing comprehensive global background support [12]. - A multidimensional reward mechanism is implemented, including context reward, format reward, and accuracy reward, to guide models in accurately understanding multimodal context [13][14]. - The model encourages complex logical reasoning by evaluating whether the reasoning process successfully integrates multimodal information and employs advanced logical analysis techniques [15]. Group 3: Model Design and Training Strategies - The model is based on Qwen2.5-Omni-Thinker, with improvements to the Group Relative Policy Optimization (GRPO) method to enhance training efficiency, fairness, and robustness [19][20]. - Token-level loss is introduced to address the imbalance in long sequence training, ensuring balanced optimization for each token [19]. - The removal of question-level normalization terms promotes consistency in the optimization process across different problem difficulties [19]. - Dynamic KL divergence is utilized to enhance exploration capabilities and training stability throughout the training cycle [20]. Group 4: High-Quality Datasets and Benchmarks - A comprehensive multimodal reasoning training dataset has been created, incorporating image, video, and audio understanding tasks with rich contextual information [23]. - IntentBench, a new multimodal benchmark, evaluates models' abilities to understand human behavior and intent in complex scenarios, featuring 633 videos and 2,689 related questions [23]. Group 5: Experimental Results - HumanOmniV2 achieved breakthrough results across multiple benchmark datasets, attaining 58.47% on Daily-Omni, 47.1% on WorldSense, and 69.33% on the newly introduced IntentBench, outperforming existing open-source multimodal models [24].
快手团队发布8B Kwai Keye-VL!技术报告速递~
自动驾驶之心· 2025-07-07 12:17
Core Insights - The article discusses the launch of Kwai Keye-VL, an 8 billion parameter multimodal large language model (MLLM) designed to enhance understanding of short video content, addressing the limitations of existing models in processing dynamic and information-dense media [2][3]. Group 1: Model Development - Kwai Keye-VL is built on a large-scale dataset containing over 600 billion tokens, primarily focused on high-quality video data, and employs an innovative training strategy [2][4]. - The training process consists of a four-stage pre-training phase followed by a two-stage post-training phase, aimed at aligning visual and language features effectively [4][18]. Group 2: Training Methodology - The first stage of training focuses on optimizing basic capabilities such as instruction following through supervised fine-tuning and mixed preference optimization [5]. - The second stage enhances reasoning abilities using a five-mode "cold start" data mixing strategy, which includes various reasoning tasks and high-quality video data [6][12]. Group 3: Performance Evaluation - Keye-VL has demonstrated advanced performance in public benchmark tests, outperforming other leading models of similar size in user experience evaluations [3][27]. - The model's capabilities were validated through extensive evaluation experiments, including the development of a new benchmark, KC-MMBench, tailored for real-world short video scenarios [3][28]. Group 4: Technical Innovations - The model incorporates a hybrid parallelism strategy for efficient training, combining data and sequence parallelism to optimize memory usage and computational efficiency [22][23]. - A dynamic load balancing mechanism is implemented to address computational load imbalances during multimodal training, significantly improving training speed [24]. - A sample-level auto-resume mechanism enhances training stability by allowing automatic recovery from interruptions [25].
6大基准全面碾压!TW-GRPO刷新视频推理天花板,CLEVRER准确率突破50.4%!
机器人大讲堂· 2025-07-06 05:23
Core Viewpoint - The rapid development of multi-modal large language models (MLLMs) is significantly enhancing video reasoning capabilities, driven by reinforcement learning (RL) as a key engine for this technological revolution [1] Group 1: TW-GRPO Framework Introduction - The TW-GRPO framework is proposed to address challenges in reasoning quality and reward granularity in video reasoning tasks, inspired by the traditional GRPO framework [2] - TW-GRPO integrates focused thinking and multi-level soft reward mechanisms for multi-choice QA tasks [3] Group 2: Key Improvements in TW-GRPO - The framework enhances information weighting and reward mechanism design, applying a soft reward mechanism from video localization to video reasoning tasks [4] - A dynamic weighting mechanism prioritizes high information density tokens, improving reasoning accuracy and efficiency by focusing on key content [4] - The multi-level reward mechanism redefines rewards, allowing for partial correctness in answers, thus improving training stability and efficiency [5] Group 3: Data Augmentation and Training Efficiency - TW-GRPO introduces a question-answer inversion (QAI) data augmentation technique to convert single-choice tasks into multi-choice formats, effectively expanding the training data pool [6] - This approach disrupts traditional equal treatment of tokens, enhancing training efficiency and reasoning performance through differentiated information processing [6] Group 4: Experimental Validation - Extensive experiments demonstrate TW-GRPO's effectiveness in video reasoning and general understanding tasks, outperforming Video-R1 by 18.8%, 1.8%, and 1.6% in various benchmarks [12][15] - The framework shows faster convergence and more stable learning processes compared to traditional GRPO, with shorter output sequences indicating more efficient reasoning [11][17] Group 5: Qualitative Analysis of Reasoning Paths - A qualitative comparison of reasoning paths between T-GRPO and TW-GRPO illustrates significant improvements in accuracy and efficiency in dynamic visual cue reasoning tasks [22]
刚刚,CVPR 2025奖项出炉:牛津&Meta博士生王建元获最佳论文,谢赛宁摘年轻研究者奖
机器之心· 2025-06-13 15:45
Core Insights - The CVPR 2025 conference in Nashville, Tennessee, awarded five papers, including one best paper and four honorable mentions, along with one best student paper and one honorable mention for student papers [1][2]. Submission and Acceptance Statistics - This year, over 40,000 authors submitted 13,008 papers, marking a 13% increase from last year's 11,532 submissions. A total of 2,872 papers were accepted, resulting in an overall acceptance rate of approximately 22.1%. Among the accepted papers, 96 were oral presentations (3.3%) and 387 were highlighted (13.7%) [3][5]. Conference Attendance - The conference attracted over 9,000 attendees from more than 70 countries and regions [7]. Paper Acceptance by Field - The image and video generation field had the highest number of accepted papers, while the highest acceptance rates were seen in 3D based on multi-view and sensor data, as well as single-image 3D [8]. Best Paper Award - The best paper, titled "VGGT: Visual Geometry Grounded Transformer," was presented by researchers from the University of Oxford and Meta AI. It introduced a universal 3D vision model based on a pure feedforward Transformer architecture, capable of inferring core geometric information from one or more images [13][14]. Notable Research Contributions - The best paper demonstrated significant performance improvements over traditional optimization methods and existing state-of-the-art models in various 3D tasks, achieving inference speeds in seconds without requiring post-processing optimization [17]. Best Student Paper - The best student paper, "Neural Inverse Rendering from Propagating Light," proposed a physics-based multi-view dynamic light propagation neural inverse rendering system, achieving state-of-the-art 3D reconstruction under strong indirect lighting conditions [53][55]. Awards and Recognitions - Two Young Researcher Awards were given to Hao Su and Saining Xie for their outstanding contributions to computer vision research [68][72]. The Longuet-Higgins Award was presented to two papers that have significantly influenced the field, including the Inception architecture and fully convolutional networks for semantic segmentation [75][78][80].
科学家证实大模型能像人类一样“理解”事物
Ke Ji Ri Bao· 2025-06-10 22:45
Core Insights - Researchers from the Chinese Academy of Sciences have confirmed that multimodal large language models can learn to "understand" objects in a manner similar to humans, paving the way for future AI systems that can comprehend the world like humans do [1][2] Group 1: Research Findings - The study utilized a clever experiment based on human cognitive principles, where both a large model and humans played a "find the difference" game, analyzing data from 4.7 million judgments to create a "concept map" of the model's thinking [2] - The researchers identified 66 key perspectives on how AI "understands" objects, which align closely with the neural activity patterns in the human brain responsible for object processing [2] - The multimodal model's approach to "thinking" and making choices is found to be more similar to human cognition compared to other models [2] Group 2: Comparison with Human Understanding - While humans consider both the appearance and meaning of objects, the large model relies more on "text labels" and learned abstract concepts, indicating a development of a somewhat human-like understanding of the world [2]