Workflow
多模态大语言模型
icon
Search documents
VLA的基础模型与大规模训练任务汇总
具身智能之心· 2025-10-08 02:49
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 今天为大家汇总一下有关 VLA的基础模型与大规模训练任务相关的几篇文章,按照发表年份来排序,后续持续更新中....... Training strategies for efficient embodied reasoning 论文时间:2025 论文链接:https://arxiv.org/abs/2505.08243 RoboBrain: A unified brain model for robotic manipulation from abstract to concrete 论文时间:2025 论文链接:https://arxiv.org/abs/2502.21257 近年来,多模态大语言模型在多模态上下文处理中展现出卓越的能力。然而,它们在机器人场景中的应用,特别是对于长周期操作任务,显示出显著的局限性。这 些局限性源于当前MLLMs ...
NeurIPS 2025 | SURDS 数据集与 GRPO 全面强化自驾空间推理
自动驾驶之心· 2025-09-27 23:33
以下文章来源于深蓝AI ,作者深蓝学院 深蓝AI . 专注于人工智能、机器人与自动驾驶的学习平台。 作者 | 深蓝学院 来源 | 深蓝AI 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 摘 要 在大模型飞速发展的当下,让多模态大语言模型(VLM)在自动驾驶场景图像中做出准确的空间推理,依然是人工智能领域的一大挑战。学术界一直缺 乏针对自动驾驶场推理的大规模基准,现有方法往往依赖外部专家模型,难以全面衡量模型能力。 与此形成鲜明对比的是,人类可以凭借已有知识轻松判断图像中物体的朝向,或推理多个物体的相对位置。而VLM同样具备丰富的知识,却仍在此类任务上 表现不足。 为此,武汉大学联合中科院自动化所,北京智源人工智能研究院 (BAAI)等多家单位推出 首个面向驾驶场景的VLM空间推理大规模基准 SURDS ,系统评测了 包括 GPT 系列在内的通用模型及 SpatialRGPT 等空间推理模型,全面揭示了当前VLM在空间理解方面的短板。研究团队通过设计"感知准确性"和" ...
VLA空间理解的能力还远未被挖掘!OccVLA的新尝试(上海期智&清华&上交等)
自动驾驶之心· 2025-09-15 23:33
❝ 自动驾驶VLA的空间理解能力,亟需新的突破。 (1)在无需昂贵人工标注的情况下,构建可用且有效的3D表示存在难度; (2)由于缺乏大规模3D视觉-语言预训练,视觉-语言模型(VLMs)中的细粒度空间细节有所丢失。 论文标题:OccVLA: Vision-Language-Action Model with Implicit 3D Occupancy Supervision 论文链接:https://arxiv.org/abs/2509.05578 现有研究已针对这一挑战开展了大量探索(如图1(a)所示)。在基于VLM的感知流水线中,监督依赖于文本描述的3D标注(例如坐标或边界框),这类标注本质上具 有稀疏性且信息量有限。生成此类标注需要大量人工标注工作,从而限制了模型的可扩展性。如图1(b)所示,近年有部分方法尝试整合3D输入,但它们受限于两个问 题:一是缺乏大规模3D视觉-语言预训练数据,二是缺乏针对复杂空间场景的详细描述文本。这类3D VLMs通常将重点放在文本输出的监督上,却忽略了丰富的3D视觉模 态信息,因此在自动驾驶的空间理解能力方面仍有提升空间。 在这一背景下,核心挑战主要体现在两方面:(1) ...
从「对口型」到「会表演」,刚进化的可灵AI数字人,技术公开了
机器之心· 2025-09-15 12:19
Core Viewpoint - The article discusses the advancements made by Kuaishou's Keling team in creating a new digital human generation paradigm, specifically through the Kling-Avatar project, which allows for expressive and natural performances in long videos, moving beyond simple lip-syncing to full-body expressions and emotional engagement [2][31]. Group 1: Technology and Framework - The Kling-Avatar utilizes a two-stage generative framework powered by a multimodal large language model, enabling the transformation of audio, visual, and textual inputs into coherent storylines for video generation [6][10]. - A multimodal director module organizes inputs into a structured narrative, extracting voice content and emotional trajectories from audio, identifying human features and scene elements from images, and integrating user text prompts into actions and emotional expressions [8][10]. - The system generates a blueprint video that outlines the overall rhythm, style, and key expression nodes, which is then used to create high-quality sub-segment videos [12][28]. Group 2: Data and Training - The Keling team collected thousands of hours of high-quality video data from various sources, including speeches and dialogues, to train multiple expert models for assessing video quality across several dimensions [14]. - A benchmark consisting of 375 reference image-audio-text prompt pairs was created to evaluate the effectiveness of the digital human video generation methods, providing a challenging testing scenario for multimodal instruction following [14][23]. Group 3: Performance and Results - The Kling-Avatar demonstrated superior performance in a comparative evaluation against advanced products like OmniHuman-1 and HeyGen, achieving higher scores in overall effectiveness, lip sync accuracy, visual quality, control response, and identity consistency [16][24]. - The generated lip movements were highly synchronized with audio, and facial expressions adapted naturally to vocal variations, even during complex phonetic sounds [25][26]. - Kling-Avatar's ability to generate long videos efficiently was highlighted, as it can produce multiple segments in parallel from a single blueprint video, maintaining quality and coherence throughout [28]. Group 4: Future Directions - The Keling team aims to continue exploring advancements in high-resolution video generation, fine-tuned motion control, and complex multi-turn instruction understanding, striving to imbue digital humans with a genuine and captivating presence [31].
视觉强化学习最新综述:全领域梳理(新加坡国立&浙大&港中文)
自动驾驶之心· 2025-08-16 00:03
Core Insights - The article discusses the integration of Reinforcement Learning with Computer Vision, marking a paradigm shift in how AI interacts with visual data [3][4] - It highlights the potential for AI to not only understand but also create and optimize visual content based on human preferences, transforming AI from passive observers to active decision-makers [4] Research Background and Overview - The emergence of Visual Reinforcement Learning (VRL) is driven by the successful application of Reinforcement Learning in Large Language Models (LLMs) [7] - The article identifies three core challenges in the field: stability in policy optimization under complex reward signals, efficient processing of high-dimensional visual inputs, and scalable reward function design for long-term decision-making [7][8] Theoretical Foundations of Visual Reinforcement Learning - The theoretical framework for VRL includes formalizing the problem using Markov Decision Processes (MDP), which unifies text and visual generation RL frameworks [15] - Three main alignment paradigms are proposed: RL with human feedback (RLHF), Direct Preference Optimization (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR) [16][18] Core Applications of Visual Reinforcement Learning - The article categorizes VRL research into four main areas: Multimodal Large Language Models (MLLM), Visual Generation, Unified Models, and Visual-Language-Action (VLA) Models [31] - Each area is further divided into specific tasks, with representative works analyzed for their contributions [31][32] Evaluation Metrics and Benchmarking - A layered evaluation framework is proposed, detailing specific benchmarks for each area to ensure reproducibility and comparability in VRL research [44][48] - The article emphasizes the need for effective metrics that align with human perception and can validate the performance of VRL systems [61] Future Directions and Challenges - The article outlines four key challenges for the future of VRL: balancing depth and efficiency in reasoning, addressing long-term RL in VLA tasks, designing reward models for visual generation, and improving data efficiency and generalization capabilities [50][52][54] - It suggests that future research should focus on integrating model-based planning, self-supervised visual pre-training, and adaptive curriculum learning to enhance the practical applications of VRL [57]
自动驾驶大模型方案:视觉语言模型VLM工作一览,面向量产和研究~
自动驾驶之心· 2025-08-06 23:34
Core Insights - The article emphasizes the transformative potential of Vision-Language Models (VLMs) in enhancing the perception and cognitive capabilities of autonomous driving systems, enabling them to not only "see" but also "understand" complex driving environments [2][3]. Group 1: VLM Applications in Autonomous Driving - VLMs can surpass traditional visual models by integrating camera images or video streams to comprehend semantic information in traffic scenes, such as recognizing complex scenarios like "a pedestrian waving to cross the street" [6]. - VLMs facilitate the conversion of intricate visual scenes into clear natural language descriptions, enhancing the interpretability of decisions made by autonomous systems, which aids in debugging and increases trust among passengers and regulators [6]. - VLMs are crucial for natural language interactions in future smart cabins, allowing passengers to communicate intentions to vehicles through spoken commands [6]. Group 2: Scenario Generation and Testing - The article introduces CrashAgent, a multi-agent framework that utilizes multi-modal large language models to convert accident reports into structured scenarios for simulation environments, addressing the long-tail distribution issue in existing datasets [7]. - CurricuVLM is proposed as a personalized curriculum learning framework that leverages VLMs to analyze agent behavior and dynamically generate tailored training scenarios, improving safety in autonomous driving [13]. - TRACE is a framework that generates key test cases from real accident reports, significantly enhancing the efficiency of defect detection in autonomous driving systems [17]. Group 3: Out-of-Distribution (OOD) Scenario Generation - A framework utilizing large language models is proposed to generate diverse OOD driving scenarios, addressing the challenges posed by the sparsity of such scenarios in urban driving datasets [21][22]. - The article discusses the development of a method to automatically convert real-world driving videos into detailed simulation scenarios, enhancing the testing of autonomous driving systems [26]. Group 4: Enhancing Safety and Robustness - WEDGE is introduced as a synthetic dataset created from generative vision-language models, aimed at improving the robustness of perception systems in extreme weather conditions [39][40]. - LKAlert is a predictive alert system that utilizes VLMs to forecast potential lane-keeping assist (LKA) risks, enhancing driver situational awareness and trust [54][55]. Group 5: Advancements in Decision-Making Frameworks - The CBR-LLM framework combines semantic scene understanding with case retrieval to enhance decision-making in complex driving scenarios, improving accuracy and reasoning consistency [44][45]. - ORION is presented as a holistic end-to-end autonomous driving framework that integrates visual-language instructed action generation, achieving superior performance in closed-loop evaluations [69][70].
AI打假AI,拿下SOTA丨厦大&腾讯优图
量子位· 2025-07-20 02:49
Core Viewpoint - The article discusses the innovative AIGI-Holmes method developed by Xiamen University and Tencent Youtu Lab for detecting AI-generated images, addressing the challenges of interpretability and generalization in existing detection models [2][12][36]. Group 1: Methodology - AIGI-Holmes employs a "large model + visual expert" collaborative architecture to enhance image detection capabilities [2][5]. - The method includes a dual-visual encoder architecture that integrates NPR visual experts to process both high-level semantics and low-level visual features [6]. - The Holmes Pipeline consists of three training phases: visual expert pre-training, supervised fine-tuning (SFT), and direct preference optimization (DPO) [7][22]. Group 2: Key Innovations - The AIGI-Holmes method addresses two critical bottlenecks in existing detection technologies: lack of interpretability and limited generalization capabilities [12][36]. - A new dataset, Holmes-Set, was constructed containing 45,000 images and 20,000 annotations to improve data scarcity issues, covering various types of generation defects [15][18]. - The model architecture includes a collaborative decoding strategy that merges predictions from visual experts and the large language model to enhance detection accuracy [8][25]. Group 3: Performance Evaluation - Experimental results indicate that AIGI-Holmes outperforms existing methods across all benchmarks in detection accuracy and interpretability [10][29]. - The model achieved optimal results in objective metrics (BLEU/ROUGE/METEOR/CIDEr) and subjective evaluations compared to current advanced models [31]. - In robustness tests against common distortions like JPEG compression and Gaussian blur, AIGI-Holmes maintained superior detection accuracy compared to other baseline methods [33][35]. Group 4: Future Directions - The team acknowledges limitations such as the hallucination problem, where the model may misinterpret normal features as defects, and the need for more granular understanding of visual defects [36][39]. - Future work will focus on addressing the hallucination issue, enhancing fine-grained understanding capabilities, and developing objective evaluation metrics for visual defect explanations [39].
自驾搞科研别蛮干!用对套路弯道超车~
自动驾驶之心· 2025-07-11 01:14
Core Viewpoint - The article emphasizes the importance of learning from experienced mentors in the field of research, particularly in LLM/MLLM, to accelerate the research process and achieve results more efficiently [1]. Group 1: Course Offerings - The program offers a 1v6 elite small class format, allowing for personalized guidance from a mentor throughout the research process [5]. - The course covers everything from model theory to practical coding, helping participants build their own knowledge systems and understand algorithm design and innovation in LLM/MLLM [1][10]. - Participants will receive tailored ideas from the mentor to kickstart their research, even if they lack a clear direction initially [7]. Group 2: Instructor Background - The instructor has a strong academic background, having graduated from a prestigious computer science university and worked as an algorithm researcher in various companies [2]. - The instructor's research includes computer vision, efficient model compression algorithms, and multimodal large language models, with a focus on lightweight models and efficient fine-tuning techniques [2][3]. Group 3: Target Audience - The program is suitable for graduate students and professionals in the fields of autonomous driving, AI, and those looking to enhance their algorithmic knowledge and research skills [11]. - It caters to individuals who need to publish papers for academic recognition or those who want to systematically master model compression and multimodal reasoning [11]. Group 4: Course Structure and Requirements - The course is designed to accommodate students with varying levels of foundational knowledge, with adjustments made to the depth of instruction based on participants' backgrounds [14]. - Participants are expected to have a basic understanding of deep learning and machine learning, familiarity with Python and PyTorch, and a willingness to engage actively in the learning process [16][19].
ICML 2025 | 给AI装上「智能升级插件」!阿里安全-清华大学D-MoLE让模型在持续学习中动态进化
机器之心· 2025-07-10 04:26
Core Viewpoint - The article discusses the development of a new framework called D-MoLE (Dynamic Mixture of Curriculum LoRA Experts) aimed at enhancing the continual adaptation capabilities of Multimodal Large Language Models (MLLMs) in response to evolving task requirements while preserving existing knowledge [4][12][13]. Research Background - Multimodal Large Language Models (MLLMs) combine various modalities such as visual and textual data, showcasing strong capabilities in handling multimodal information [3]. - A significant challenge in practical applications is the phenomenon of catastrophic forgetting, where models lose previously acquired knowledge when fine-tuned for new tasks [4]. Key Challenges - The need for continual multimodal instruction tuning (CMIT) arises to allow MLLMs to adapt to new tasks while retaining past knowledge [4][12]. - Two main challenges identified are task architecture conflicts and modality imbalance, where different tasks have varying dependencies on model layers and modalities [4][7]. Proposed Solution - D-MoLE framework allows dynamic adjustment of model architecture based on task requirements, introducing additional parameter modules (LoRA experts) as needed [10][13]. - It incorporates a gradient-based continual curriculum strategy to balance updates across different modalities, ensuring more equitable optimization [10][12]. Methodology - D-MoLE consists of two core components: a dynamic layer-wise expert allocator and a gradient-based inter-modal continual curriculum mechanism [16][22]. - The dynamic allocator identifies critical layers for adaptation and allocates LoRA experts accordingly, while the curriculum mechanism adjusts the update ratio between language models and modality encoders based on task difficulty [22][24]. Experimental Results - D-MoLE was evaluated on a benchmark comprising nine datasets across visual question answering, image captioning, and visual grounding [27]. - The framework demonstrated significant improvements over baseline methods, achieving an average performance increase of approximately 15.08% and reducing backward transfer (BWT) from -21.31% to -1.49% [29]. General Capability Assessment - D-MoLE maintained strong general multimodal capabilities, outperforming traditional methods in various evaluation benchmarks [30][31]. Training Efficiency - Despite the introduction of new mechanisms, D-MoLE's total training time was comparable to traditional methods, demonstrating efficiency in training through selective parameter updates [36]. Business Application - D-MoLE can enhance Alibaba's security multimodal auditing models, allowing for rapid adaptation to different platform rules without extensive retraining, thus reducing operational costs and improving flexibility [38][39].
突破全模态AI理解边界:引入上下文强化学习,赋能全模态模型“意图”推理新高度
量子位· 2025-07-08 07:30
Core Viewpoint - The article emphasizes the increasing need for deep understanding and analysis of human intent in the context of multimodal large language models (MLLMs) and highlights the challenges faced in applying reinforcement learning (RL) effectively to complex multimodal data and formats [1][4]. Group 1: Challenges in Multimodal Reasoning - Insufficient global context understanding leads to incorrect answers when models fail to accurately identify or misinterpret multimodal evidence and contextual information [3]. - The shortcut problem arises when models overlook key clues and provide answers without fully considering multimodal information, resulting in suboptimal or partial outcomes [4]. Group 2: Innovations and Advantages - HumanOmniV2 introduces a mandatory context summarization before reasoning, ensuring models do not skip critical multimodal input and providing comprehensive global background support [12]. - A multidimensional reward mechanism is implemented, including context reward, format reward, and accuracy reward, to guide models in accurately understanding multimodal context [13][14]. - The model encourages complex logical reasoning by evaluating whether the reasoning process successfully integrates multimodal information and employs advanced logical analysis techniques [15]. Group 3: Model Design and Training Strategies - The model is based on Qwen2.5-Omni-Thinker, with improvements to the Group Relative Policy Optimization (GRPO) method to enhance training efficiency, fairness, and robustness [19][20]. - Token-level loss is introduced to address the imbalance in long sequence training, ensuring balanced optimization for each token [19]. - The removal of question-level normalization terms promotes consistency in the optimization process across different problem difficulties [19]. - Dynamic KL divergence is utilized to enhance exploration capabilities and training stability throughout the training cycle [20]. Group 4: High-Quality Datasets and Benchmarks - A comprehensive multimodal reasoning training dataset has been created, incorporating image, video, and audio understanding tasks with rich contextual information [23]. - IntentBench, a new multimodal benchmark, evaluates models' abilities to understand human behavior and intent in complex scenarios, featuring 633 videos and 2,689 related questions [23]. Group 5: Experimental Results - HumanOmniV2 achieved breakthrough results across multiple benchmark datasets, attaining 58.47% on Daily-Omni, 47.1% on WorldSense, and 69.33% on the newly introduced IntentBench, outperforming existing open-source multimodal models [24].