多模态大语言模型
Search documents
NeurIPS 2025 | SURDS 数据集与 GRPO 全面强化自驾空间推理
自动驾驶之心· 2025-09-27 23:33
Core Insights - The article discusses the challenges of achieving accurate spatial reasoning in autonomous driving scenarios using Vision Language Models (VLMs), highlighting the lack of large-scale benchmarks in this area [2][20]. - A new benchmark called SURDS has been introduced to systematically evaluate the spatial reasoning capabilities of VLMs, revealing significant shortcomings in current models [4][20]. Benchmark Overview - SURDS is a large-scale benchmark based on the nuScenes dataset, consisting of 41,080 visual-question training instances and 9,250 evaluation samples, covering six spatial categories: direction recognition, pixel-level localization, depth estimation, distance comparison, left-right ordering, and front-back relationships [4][20]. - The dataset includes diverse multimodal information collected from urban environments in Boston and Singapore, ensuring a realistic testing scenario [6][20]. Model Training and Evaluation - The research emphasizes the importance of data generation and introduces a novel automated process for generating high-quality reasoning chains, which enhances the model's spatial reasoning capabilities [8][10]. - A reinforcement learning framework combining spatial localization rewards and logical consistency objectives was designed, leading to significant performance improvements in various tasks [11][20]. Experimental Results - The evaluation results show that different models exhibit notable differences in spatial reasoning tasks, with the proposed model achieving a nearly 60% improvement in depth estimation accuracy compared to the second-best model [14][20]. - The study reveals that most existing models struggle with single-object tasks, often performing close to random levels, indicating a need for better learning of absolute pose and metric information [16][20]. Training Strategy Insights - Ablation studies indicate that combining localization and logical rewards significantly enhances model performance, underscoring the foundational role of localization ability in spatial reasoning [16][18]. - The research also highlights that the scale of model parameters does not directly correlate with spatial understanding capabilities, suggesting that simply increasing model size is insufficient [16][20].
VLA空间理解的能力还远未被挖掘!OccVLA的新尝试(上海期智&清华&上交等)
自动驾驶之心· 2025-09-15 23:33
Core Insights - The article discusses the limitations of existing multimodal large language models (MLLMs) in robust 3D spatial understanding, which is crucial for autonomous driving [3][4] - It introduces OccVLA, a novel framework that integrates 3D occupancy representation into a unified multimodal reasoning process, enhancing the model's ability to learn fine-grained spatial structures from 2D visual inputs [3][9] Group 1: Introduction and Challenges - Recent advancements in end-to-end autonomous driving technology have highlighted the gap between 2D and 3D perception, which limits the widespread application of visual-language models (VLMs) in complex driving scenarios [4][5] - Two main challenges are identified: the difficulty in constructing usable and effective 3D representations without expensive manual annotations, and the lack of large-scale 3D visual-language pre-training that results in loss of fine-grained spatial details [5][8] Group 2: OccVLA Framework - OccVLA is designed to perform occupancy prediction, visual-language reasoning, and action generation tasks simultaneously, addressing the sparsity of occupancy representation and enhancing 3D understanding capabilities [9][18] - The framework employs a cross-attention mechanism to receive visual features from the VLM's intermediate layers, allowing for effective integration of occupancy tokens into the reasoning process without additional computational overhead [9][20] Group 3: Performance and Contributions - OccVLA has demonstrated superior performance in various perception and planning tasks, achieving state-of-the-art results on the nuScenes dataset for trajectory planning and 3D visual question answering [10][11] - The main contributions of the article include the introduction of the OccVLA framework, the design of a cross-modal attention mechanism that allows skipping the occupancy prediction process during inference, and the achievement of competitive results in trajectory planning tasks [11][36] Group 4: Experimental Results - The experiments utilized the nuScenes dataset, which includes 700 training scenes and 150 validation scenes, to evaluate the model's capabilities in 3D localization, target querying, and relational comparison tasks [35][36] - OccVLA's motion planning capabilities were compared with several baseline models, showing that it achieves optimal performance with only camera input and occupancy information as supervision, outperforming models that rely on more complex input data [37][38] Group 5: Visual Question Answering - The model was tested on the challenging NuScenes-QA benchmark dataset, demonstrating its ability to learn 3D understanding from pure visual input, surpassing larger models that depend on LiDAR data or explicit ground truth occupancy information [41][42] - The results indicate that OccVLA effectively integrates occupancy supervision to enhance its 3D reasoning capabilities in autonomous driving scenarios [41][45]
从「对口型」到「会表演」,刚进化的可灵AI数字人,技术公开了
机器之心· 2025-09-15 12:19
Core Viewpoint - The article discusses the advancements made by Kuaishou's Keling team in creating a new digital human generation paradigm, specifically through the Kling-Avatar project, which allows for expressive and natural performances in long videos, moving beyond simple lip-syncing to full-body expressions and emotional engagement [2][31]. Group 1: Technology and Framework - The Kling-Avatar utilizes a two-stage generative framework powered by a multimodal large language model, enabling the transformation of audio, visual, and textual inputs into coherent storylines for video generation [6][10]. - A multimodal director module organizes inputs into a structured narrative, extracting voice content and emotional trajectories from audio, identifying human features and scene elements from images, and integrating user text prompts into actions and emotional expressions [8][10]. - The system generates a blueprint video that outlines the overall rhythm, style, and key expression nodes, which is then used to create high-quality sub-segment videos [12][28]. Group 2: Data and Training - The Keling team collected thousands of hours of high-quality video data from various sources, including speeches and dialogues, to train multiple expert models for assessing video quality across several dimensions [14]. - A benchmark consisting of 375 reference image-audio-text prompt pairs was created to evaluate the effectiveness of the digital human video generation methods, providing a challenging testing scenario for multimodal instruction following [14][23]. Group 3: Performance and Results - The Kling-Avatar demonstrated superior performance in a comparative evaluation against advanced products like OmniHuman-1 and HeyGen, achieving higher scores in overall effectiveness, lip sync accuracy, visual quality, control response, and identity consistency [16][24]. - The generated lip movements were highly synchronized with audio, and facial expressions adapted naturally to vocal variations, even during complex phonetic sounds [25][26]. - Kling-Avatar's ability to generate long videos efficiently was highlighted, as it can produce multiple segments in parallel from a single blueprint video, maintaining quality and coherence throughout [28]. Group 4: Future Directions - The Keling team aims to continue exploring advancements in high-resolution video generation, fine-tuned motion control, and complex multi-turn instruction understanding, striving to imbue digital humans with a genuine and captivating presence [31].
视觉强化学习最新综述:全领域梳理(新加坡国立&浙大&港中文)
自动驾驶之心· 2025-08-16 00:03
Core Insights - The article discusses the integration of Reinforcement Learning with Computer Vision, marking a paradigm shift in how AI interacts with visual data [3][4] - It highlights the potential for AI to not only understand but also create and optimize visual content based on human preferences, transforming AI from passive observers to active decision-makers [4] Research Background and Overview - The emergence of Visual Reinforcement Learning (VRL) is driven by the successful application of Reinforcement Learning in Large Language Models (LLMs) [7] - The article identifies three core challenges in the field: stability in policy optimization under complex reward signals, efficient processing of high-dimensional visual inputs, and scalable reward function design for long-term decision-making [7][8] Theoretical Foundations of Visual Reinforcement Learning - The theoretical framework for VRL includes formalizing the problem using Markov Decision Processes (MDP), which unifies text and visual generation RL frameworks [15] - Three main alignment paradigms are proposed: RL with human feedback (RLHF), Direct Preference Optimization (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR) [16][18] Core Applications of Visual Reinforcement Learning - The article categorizes VRL research into four main areas: Multimodal Large Language Models (MLLM), Visual Generation, Unified Models, and Visual-Language-Action (VLA) Models [31] - Each area is further divided into specific tasks, with representative works analyzed for their contributions [31][32] Evaluation Metrics and Benchmarking - A layered evaluation framework is proposed, detailing specific benchmarks for each area to ensure reproducibility and comparability in VRL research [44][48] - The article emphasizes the need for effective metrics that align with human perception and can validate the performance of VRL systems [61] Future Directions and Challenges - The article outlines four key challenges for the future of VRL: balancing depth and efficiency in reasoning, addressing long-term RL in VLA tasks, designing reward models for visual generation, and improving data efficiency and generalization capabilities [50][52][54] - It suggests that future research should focus on integrating model-based planning, self-supervised visual pre-training, and adaptive curriculum learning to enhance the practical applications of VRL [57]
自动驾驶大模型方案:视觉语言模型VLM工作一览,面向量产和研究~
自动驾驶之心· 2025-08-06 23:34
Core Insights - The article emphasizes the transformative potential of Vision-Language Models (VLMs) in enhancing the perception and cognitive capabilities of autonomous driving systems, enabling them to not only "see" but also "understand" complex driving environments [2][3]. Group 1: VLM Applications in Autonomous Driving - VLMs can surpass traditional visual models by integrating camera images or video streams to comprehend semantic information in traffic scenes, such as recognizing complex scenarios like "a pedestrian waving to cross the street" [6]. - VLMs facilitate the conversion of intricate visual scenes into clear natural language descriptions, enhancing the interpretability of decisions made by autonomous systems, which aids in debugging and increases trust among passengers and regulators [6]. - VLMs are crucial for natural language interactions in future smart cabins, allowing passengers to communicate intentions to vehicles through spoken commands [6]. Group 2: Scenario Generation and Testing - The article introduces CrashAgent, a multi-agent framework that utilizes multi-modal large language models to convert accident reports into structured scenarios for simulation environments, addressing the long-tail distribution issue in existing datasets [7]. - CurricuVLM is proposed as a personalized curriculum learning framework that leverages VLMs to analyze agent behavior and dynamically generate tailored training scenarios, improving safety in autonomous driving [13]. - TRACE is a framework that generates key test cases from real accident reports, significantly enhancing the efficiency of defect detection in autonomous driving systems [17]. Group 3: Out-of-Distribution (OOD) Scenario Generation - A framework utilizing large language models is proposed to generate diverse OOD driving scenarios, addressing the challenges posed by the sparsity of such scenarios in urban driving datasets [21][22]. - The article discusses the development of a method to automatically convert real-world driving videos into detailed simulation scenarios, enhancing the testing of autonomous driving systems [26]. Group 4: Enhancing Safety and Robustness - WEDGE is introduced as a synthetic dataset created from generative vision-language models, aimed at improving the robustness of perception systems in extreme weather conditions [39][40]. - LKAlert is a predictive alert system that utilizes VLMs to forecast potential lane-keeping assist (LKA) risks, enhancing driver situational awareness and trust [54][55]. Group 5: Advancements in Decision-Making Frameworks - The CBR-LLM framework combines semantic scene understanding with case retrieval to enhance decision-making in complex driving scenarios, improving accuracy and reasoning consistency [44][45]. - ORION is presented as a holistic end-to-end autonomous driving framework that integrates visual-language instructed action generation, achieving superior performance in closed-loop evaluations [69][70].
AI打假AI,拿下SOTA丨厦大&腾讯优图
量子位· 2025-07-20 02:49
Core Viewpoint - The article discusses the innovative AIGI-Holmes method developed by Xiamen University and Tencent Youtu Lab for detecting AI-generated images, addressing the challenges of interpretability and generalization in existing detection models [2][12][36]. Group 1: Methodology - AIGI-Holmes employs a "large model + visual expert" collaborative architecture to enhance image detection capabilities [2][5]. - The method includes a dual-visual encoder architecture that integrates NPR visual experts to process both high-level semantics and low-level visual features [6]. - The Holmes Pipeline consists of three training phases: visual expert pre-training, supervised fine-tuning (SFT), and direct preference optimization (DPO) [7][22]. Group 2: Key Innovations - The AIGI-Holmes method addresses two critical bottlenecks in existing detection technologies: lack of interpretability and limited generalization capabilities [12][36]. - A new dataset, Holmes-Set, was constructed containing 45,000 images and 20,000 annotations to improve data scarcity issues, covering various types of generation defects [15][18]. - The model architecture includes a collaborative decoding strategy that merges predictions from visual experts and the large language model to enhance detection accuracy [8][25]. Group 3: Performance Evaluation - Experimental results indicate that AIGI-Holmes outperforms existing methods across all benchmarks in detection accuracy and interpretability [10][29]. - The model achieved optimal results in objective metrics (BLEU/ROUGE/METEOR/CIDEr) and subjective evaluations compared to current advanced models [31]. - In robustness tests against common distortions like JPEG compression and Gaussian blur, AIGI-Holmes maintained superior detection accuracy compared to other baseline methods [33][35]. Group 4: Future Directions - The team acknowledges limitations such as the hallucination problem, where the model may misinterpret normal features as defects, and the need for more granular understanding of visual defects [36][39]. - Future work will focus on addressing the hallucination issue, enhancing fine-grained understanding capabilities, and developing objective evaluation metrics for visual defect explanations [39].
自驾搞科研别蛮干!用对套路弯道超车~
自动驾驶之心· 2025-07-11 01:14
Core Viewpoint - The article emphasizes the importance of learning from experienced mentors in the field of research, particularly in LLM/MLLM, to accelerate the research process and achieve results more efficiently [1]. Group 1: Course Offerings - The program offers a 1v6 elite small class format, allowing for personalized guidance from a mentor throughout the research process [5]. - The course covers everything from model theory to practical coding, helping participants build their own knowledge systems and understand algorithm design and innovation in LLM/MLLM [1][10]. - Participants will receive tailored ideas from the mentor to kickstart their research, even if they lack a clear direction initially [7]. Group 2: Instructor Background - The instructor has a strong academic background, having graduated from a prestigious computer science university and worked as an algorithm researcher in various companies [2]. - The instructor's research includes computer vision, efficient model compression algorithms, and multimodal large language models, with a focus on lightweight models and efficient fine-tuning techniques [2][3]. Group 3: Target Audience - The program is suitable for graduate students and professionals in the fields of autonomous driving, AI, and those looking to enhance their algorithmic knowledge and research skills [11]. - It caters to individuals who need to publish papers for academic recognition or those who want to systematically master model compression and multimodal reasoning [11]. Group 4: Course Structure and Requirements - The course is designed to accommodate students with varying levels of foundational knowledge, with adjustments made to the depth of instruction based on participants' backgrounds [14]. - Participants are expected to have a basic understanding of deep learning and machine learning, familiarity with Python and PyTorch, and a willingness to engage actively in the learning process [16][19].
ICML 2025 | 给AI装上「智能升级插件」!阿里安全-清华大学D-MoLE让模型在持续学习中动态进化
机器之心· 2025-07-10 04:26
Core Viewpoint - The article discusses the development of a new framework called D-MoLE (Dynamic Mixture of Curriculum LoRA Experts) aimed at enhancing the continual adaptation capabilities of Multimodal Large Language Models (MLLMs) in response to evolving task requirements while preserving existing knowledge [4][12][13]. Research Background - Multimodal Large Language Models (MLLMs) combine various modalities such as visual and textual data, showcasing strong capabilities in handling multimodal information [3]. - A significant challenge in practical applications is the phenomenon of catastrophic forgetting, where models lose previously acquired knowledge when fine-tuned for new tasks [4]. Key Challenges - The need for continual multimodal instruction tuning (CMIT) arises to allow MLLMs to adapt to new tasks while retaining past knowledge [4][12]. - Two main challenges identified are task architecture conflicts and modality imbalance, where different tasks have varying dependencies on model layers and modalities [4][7]. Proposed Solution - D-MoLE framework allows dynamic adjustment of model architecture based on task requirements, introducing additional parameter modules (LoRA experts) as needed [10][13]. - It incorporates a gradient-based continual curriculum strategy to balance updates across different modalities, ensuring more equitable optimization [10][12]. Methodology - D-MoLE consists of two core components: a dynamic layer-wise expert allocator and a gradient-based inter-modal continual curriculum mechanism [16][22]. - The dynamic allocator identifies critical layers for adaptation and allocates LoRA experts accordingly, while the curriculum mechanism adjusts the update ratio between language models and modality encoders based on task difficulty [22][24]. Experimental Results - D-MoLE was evaluated on a benchmark comprising nine datasets across visual question answering, image captioning, and visual grounding [27]. - The framework demonstrated significant improvements over baseline methods, achieving an average performance increase of approximately 15.08% and reducing backward transfer (BWT) from -21.31% to -1.49% [29]. General Capability Assessment - D-MoLE maintained strong general multimodal capabilities, outperforming traditional methods in various evaluation benchmarks [30][31]. Training Efficiency - Despite the introduction of new mechanisms, D-MoLE's total training time was comparable to traditional methods, demonstrating efficiency in training through selective parameter updates [36]. Business Application - D-MoLE can enhance Alibaba's security multimodal auditing models, allowing for rapid adaptation to different platform rules without extensive retraining, thus reducing operational costs and improving flexibility [38][39].
突破全模态AI理解边界:引入上下文强化学习,赋能全模态模型“意图”推理新高度
量子位· 2025-07-08 07:30
Core Viewpoint - The article emphasizes the increasing need for deep understanding and analysis of human intent in the context of multimodal large language models (MLLMs) and highlights the challenges faced in applying reinforcement learning (RL) effectively to complex multimodal data and formats [1][4]. Group 1: Challenges in Multimodal Reasoning - Insufficient global context understanding leads to incorrect answers when models fail to accurately identify or misinterpret multimodal evidence and contextual information [3]. - The shortcut problem arises when models overlook key clues and provide answers without fully considering multimodal information, resulting in suboptimal or partial outcomes [4]. Group 2: Innovations and Advantages - HumanOmniV2 introduces a mandatory context summarization before reasoning, ensuring models do not skip critical multimodal input and providing comprehensive global background support [12]. - A multidimensional reward mechanism is implemented, including context reward, format reward, and accuracy reward, to guide models in accurately understanding multimodal context [13][14]. - The model encourages complex logical reasoning by evaluating whether the reasoning process successfully integrates multimodal information and employs advanced logical analysis techniques [15]. Group 3: Model Design and Training Strategies - The model is based on Qwen2.5-Omni-Thinker, with improvements to the Group Relative Policy Optimization (GRPO) method to enhance training efficiency, fairness, and robustness [19][20]. - Token-level loss is introduced to address the imbalance in long sequence training, ensuring balanced optimization for each token [19]. - The removal of question-level normalization terms promotes consistency in the optimization process across different problem difficulties [19]. - Dynamic KL divergence is utilized to enhance exploration capabilities and training stability throughout the training cycle [20]. Group 4: High-Quality Datasets and Benchmarks - A comprehensive multimodal reasoning training dataset has been created, incorporating image, video, and audio understanding tasks with rich contextual information [23]. - IntentBench, a new multimodal benchmark, evaluates models' abilities to understand human behavior and intent in complex scenarios, featuring 633 videos and 2,689 related questions [23]. Group 5: Experimental Results - HumanOmniV2 achieved breakthrough results across multiple benchmark datasets, attaining 58.47% on Daily-Omni, 47.1% on WorldSense, and 69.33% on the newly introduced IntentBench, outperforming existing open-source multimodal models [24].
快手团队发布8B Kwai Keye-VL!技术报告速递~
自动驾驶之心· 2025-07-07 12:17
Core Insights - The article discusses the launch of Kwai Keye-VL, an 8 billion parameter multimodal large language model (MLLM) designed to enhance understanding of short video content, addressing the limitations of existing models in processing dynamic and information-dense media [2][3]. Group 1: Model Development - Kwai Keye-VL is built on a large-scale dataset containing over 600 billion tokens, primarily focused on high-quality video data, and employs an innovative training strategy [2][4]. - The training process consists of a four-stage pre-training phase followed by a two-stage post-training phase, aimed at aligning visual and language features effectively [4][18]. Group 2: Training Methodology - The first stage of training focuses on optimizing basic capabilities such as instruction following through supervised fine-tuning and mixed preference optimization [5]. - The second stage enhances reasoning abilities using a five-mode "cold start" data mixing strategy, which includes various reasoning tasks and high-quality video data [6][12]. Group 3: Performance Evaluation - Keye-VL has demonstrated advanced performance in public benchmark tests, outperforming other leading models of similar size in user experience evaluations [3][27]. - The model's capabilities were validated through extensive evaluation experiments, including the development of a new benchmark, KC-MMBench, tailored for real-world short video scenarios [3][28]. Group 4: Technical Innovations - The model incorporates a hybrid parallelism strategy for efficient training, combining data and sequence parallelism to optimize memory usage and computational efficiency [22][23]. - A dynamic load balancing mechanism is implemented to address computational load imbalances during multimodal training, significantly improving training speed [24]. - A sample-level auto-resume mechanism enhances training stability by allowing automatic recovery from interruptions [25].