多模态大语言模型
Search documents
首个地球科学智能体Earth-Agent来了,解锁地球观测数据分析新范式
机器之心· 2025-10-27 08:44
Core Insights - The article discusses the development of Earth-Agent, a multi-modal large language model (LLM) designed to enhance Earth science research by automating complex analytical tasks and mimicking expert capabilities [3][10]. Group 1: Earth-Agent Overview - Earth-Agent aims to function as an "AI scientist" capable of understanding research intentions and autonomously planning analysis workflows [3]. - The model can process raw spectral data, remote sensing images, and Earth product data, performing tasks from data preprocessing to spatiotemporal analysis [3][10]. Group 2: Framework and Methodology - The Earth-Agent framework consists of two key components: encapsulation of domain knowledge into standardized, executable functions and the use of LLM for intelligent planning and scheduling [10]. - A total of 104 specialized tools have been integrated into the tool library, allowing the agent to dynamically select the most appropriate tools for various tasks [10]. Group 3: Benchmarking and Evaluation - Earth-Bench, a dataset used for evaluating Earth-Agent, includes 248 expert-annotated tasks across 13,729 images, emphasizing the agent's ability to execute complete Earth science analysis workflows [12][13]. - The evaluation process includes both step-by-step reasoning and end-to-end assessments, focusing on the reasoning process as well as the final results [17]. Group 4: Performance Comparison - Earth-Agent outperforms traditional agent architectures and MLLM methods in various tasks, demonstrating superior capabilities in Earth observation tasks [22]. - In comparative experiments, Earth-Agent achieved an average accuracy of 55.83% across different modalities, significantly higher than other models [22]. Group 5: Future Directions - The article suggests that Earth-Agent represents a new learning paradigm, externalizing capabilities into a structured tool library rather than encoding all knowledge within the model [26]. - Future developments may include expanding the tool library, addressing issues like "tool hallucination," and integrating visual capabilities to enhance tool perception [26].
RAG、Search Agent不香了?苹果DeepMMSearch-R1杀入多模态搜索新战场
3 6 Ke· 2025-10-17 02:44
Core Insights - Apple has introduced a new model called DeepMMSearch-R1, which enhances multimodal large language models (MLLMs) for web search by enabling dynamic querying and self-correction during multi-round interactions [1][6]. Model Development - The DeepMMSearch-R1 model addresses limitations in existing methods like retrieval-augmented generation (RAG) and search agents, which often suffer from inefficiencies and poor results due to rigid processes and excessive search calls [1][3]. - The model employs a two-stage training process: supervised fine-tuning (SFT) followed by online reinforcement learning (RL) using the Group-Relative Policy Optimization (GRPO) algorithm [3][5][10]. Dataset Creation - Apple has created a new dataset named DeepMMSearchVQA, which includes diverse visual question-answering samples presented in multi-turn dialogue format, ensuring a balanced distribution across different knowledge categories [3][7]. - The dataset consists of approximately 47,000 refined dialogue samples, derived from a random selection of 200,000 samples from the InfoSeek training set, ensuring quality by retaining only those dialogues that align with the predictions of the Gemini-2.5-Pro model [7]. Search Process Integration - The model integrates three tools: a text search tool for targeted queries, a Grounding DINO-based image localization tool for identifying relevant areas in images, and an image search tool for retrieving web content based on input images [4][5]. - This targeted search approach significantly improves retrieval quality and overall performance [3][4]. Performance Metrics - The DeepMMSearch-R1 model has shown significant performance improvements over RAG workflows and prompt-based search agents, achieving a +21.13% and +8.89% increase in performance, respectively [13]. - The model's performance is comparable to OpenAI's o3, indicating its competitive edge in the market [13]. Training Efficiency - The SFT phase focuses on enhancing the language model's reasoning capabilities for web retrieval, while the RL phase optimizes tool selection behavior by reducing unnecessary calls [16][17]. - The model maintains its general visual question-answering capabilities while learning to interact with web search tools effectively [19][20].
RAG、Search Agent不香了?苹果DeepMMSearch-R1杀入多模态搜索新战场
机器之心· 2025-10-17 02:11
Core Insights - Apple has introduced a new solution for empowering multimodal large language models (MLLMs) in multimodal web search, addressing inefficiencies in existing methods like retrieval-augmented generation (RAG) and search agents [1][5]. Group 1: Model Development - The DeepMMSearch-R1 model allows for on-demand multi-round web searches and dynamically generates queries for text and image search tools, improving efficiency and results [1][3]. - A two-stage training process is employed, starting with supervised fine-tuning (SFT) followed by online reinforcement learning (RL) using the GRPO algorithm, aimed at optimizing search initiation and tool usage [3][4]. Group 2: Dataset Creation - Apple has created a new dataset called DeepMMSearchVQA, which includes diverse multi-hop visual question-answering samples presented in multi-round dialogue format, balancing different knowledge categories [4][7]. - The dataset construction involved selecting 200,000 samples from the InfoSeek training set, resulting in approximately 47,000 refined dialogue samples for training [7]. Group 3: Training Process - In the SFT phase, the Qwen2.5-VL-7B model is fine-tuned to enhance its reasoning capabilities for web search information while keeping the visual encoder frozen [9]. - The RL phase utilizes GRPO to improve training stability by comparing candidate responses generated under the same prompt, optimizing the model's tool selection behavior [10][12]. Group 4: Performance Results - The DeepMMSearch-R1 model significantly outperforms RAG workflows and prompt-based search agents, achieving a performance increase of +21.13% and +8.89% respectively [16]. - The model's ability to perform targeted image searches and self-reflection enhances overall performance, as demonstrated in various experiments [16][18]. Group 5: Tool Utilization - The model's tool usage behavior aligns with dataset characteristics, with 87.7% tool invocation in the DynVQA dataset and 43.5% in the OKVQA dataset [20]. - The RL model effectively corrects unnecessary tool usage observed in the SFT model, highlighting the importance of RL in optimizing tool efficiency [21]. Group 6: Generalization Capability - The use of LoRA modules during SFT and KL penalty in online GRPO training helps maintain the model's general visual question-answering capabilities across multiple datasets [23][24].
不再靠「猜坐标」!颜水成团队等联合发布PaDT多模态大模型:实现真正的多模态表征输出
机器之心· 2025-10-16 00:51
Core Insights - The article discusses the advancements in Multimodal Large Language Models (MLLMs) and introduces a new paradigm called Patch-as-Decodable Token (PaDT) to address the limitations of existing models in tasks requiring fine spatial understanding [2][6]. Group 1: PaDT Overview - PaDT proposes a revolutionary approach by dividing images into multiple visual patches and allowing the model to generate corresponding Visual Reference Tokens (VRTs) directly [3]. - It enables seamless alternation between text tokens and visual tokens at both input and output stages, making the model's description of image content as natural as describing text [4]. - The model can directly indicate image targets in generated sentences rather than guessing coordinates [5]. Group 2: Limitations of Traditional MLLMs - Traditional MLLMs output detection box coordinates in string format, leading to inconsistencies, semantic disconnection, and weak image-text associations [8]. - The output format can vary, making it difficult to parse targets, and numbers can be split into separate tokens, disrupting spatial continuity [8]. - The reliance on coordinate tokens, which lack inherent semantic meaning, results in challenges such as hallucination and repetition in generated outputs [8]. Group 3: PaDT Mechanism - PaDT introduces VRTs derived from the visual patch embeddings of the input image, creating a dynamic embedding table that integrates both text and visual information [11]. - This design avoids the pitfalls of traditional methods that depend on global visual codebooks, which can confuse similar objects and generate non-existent patches [13]. - The lightweight PaDT Decoder, consisting of three bidirectional attention blocks, transforms VRTs into structured visual outputs like bounding boxes and segmentation masks [15]. Group 4: Performance Metrics - PaDT Pro (3B) achieved a remarkable average accuracy of 93.6 in the RefCOCO/+/g referring expression comprehension task, surpassing the 78B InternVL3 model, which scored 91.4 [21][22]. - In the COCO open vocabulary detection task, traditional MLLMs typically have a mean Average Precision (mAP) below 20, while PaDT Pro (3B) raised it to 38.2, nearly doubling the performance [21][24]. - The model also demonstrated strong performance in the Referring Image Captioning (RIC) task, significantly improving the CIDEr-D score from 0.386 to 1.450 [24]. Group 5: Implications and Future Directions - PaDT's success stems from its deep understanding of the visual capability bottlenecks in MLLMs, allowing for native alignment between visual patches and generated tokens [31]. - The dynamic embedding mechanism ensures strong binding of VRTs to the current image, preventing cross-image confusion [31]. - The model exhibits robust multitasking capabilities, outperforming single-task models by seamlessly switching tasks through prompt changes [33]. - The introduction of PaDT marks a significant step towards achieving true multimodal intelligence, allowing for more natural interactions between different modalities [35].
VLA的基础模型与大规模训练任务汇总
具身智能之心· 2025-10-08 02:49
Core Insights - The article summarizes several research papers related to Vision-Language-Action (VLA) models and their training strategies, highlighting advancements in embodied intelligence and robotics [2][3][5][7][9][11][13][15][17][19]. Group 1: Training Strategies and Model Improvements - The paper "Training strategies for efficient embodied reasoning" discusses the use of Chain of Thought (CoT) reasoning to enhance the performance and generalization of VLA models, achieving a threefold increase in reasoning speed compared to standard methods [3]. - "CAST: Counterfactual labels improve instruction following in vision-language-action models" introduces a method to generate counterfactual labels, which significantly improves the instruction-following capabilities of VLA models, with a 27% increase in navigation task success rates [5]. - "RoboBrain: A unified brain model for robotic manipulation" presents a new dataset, ShareRobot, which enhances the planning and trajectory prediction capabilities of robots, leading to state-of-the-art performance in various tasks [7]. Group 2: Dataset Development and Evaluation - The "DROID" dataset is introduced as a large-scale, diverse dataset for robot manipulation, containing 76,000 demonstration trajectories collected over 350 hours, which improves performance and generalization of trained strategies [9]. - "ViSA-Flow" proposes a framework for learning from large-scale video data, achieving state-of-the-art performance in robot skill learning, particularly in low-data scenarios [11]. - The "CORTEXBENCH" benchmark evaluates pre-trained visual representations for embodied AI, revealing that no single representation excels across all tasks, but task-specific adaptations can lead to significant performance improvements [13]. Group 3: Generalist Robot Policies and Learning Frameworks - "Effective tuning strategies for generalist robot manipulation policies" identifies key factors influencing the performance of Generalist Manipulation Policies (GMPs) during fine-tuning, establishing a new benchmark for future research [15]. - The "CACTI" framework focuses on scalable multi-task learning in robotic systems, demonstrating effective training across various kitchen tasks in both real and simulated environments [17]. - "R3m: A universal visual representation for robot manipulation" shows that pre-trained visual representations can enhance data-efficient learning in real-world environments, improving task success rates by over 20% compared to training from scratch [19].
NeurIPS 2025 | SURDS 数据集与 GRPO 全面强化自驾空间推理
自动驾驶之心· 2025-09-27 23:33
Core Insights - The article discusses the challenges of achieving accurate spatial reasoning in autonomous driving scenarios using Vision Language Models (VLMs), highlighting the lack of large-scale benchmarks in this area [2][20]. - A new benchmark called SURDS has been introduced to systematically evaluate the spatial reasoning capabilities of VLMs, revealing significant shortcomings in current models [4][20]. Benchmark Overview - SURDS is a large-scale benchmark based on the nuScenes dataset, consisting of 41,080 visual-question training instances and 9,250 evaluation samples, covering six spatial categories: direction recognition, pixel-level localization, depth estimation, distance comparison, left-right ordering, and front-back relationships [4][20]. - The dataset includes diverse multimodal information collected from urban environments in Boston and Singapore, ensuring a realistic testing scenario [6][20]. Model Training and Evaluation - The research emphasizes the importance of data generation and introduces a novel automated process for generating high-quality reasoning chains, which enhances the model's spatial reasoning capabilities [8][10]. - A reinforcement learning framework combining spatial localization rewards and logical consistency objectives was designed, leading to significant performance improvements in various tasks [11][20]. Experimental Results - The evaluation results show that different models exhibit notable differences in spatial reasoning tasks, with the proposed model achieving a nearly 60% improvement in depth estimation accuracy compared to the second-best model [14][20]. - The study reveals that most existing models struggle with single-object tasks, often performing close to random levels, indicating a need for better learning of absolute pose and metric information [16][20]. Training Strategy Insights - Ablation studies indicate that combining localization and logical rewards significantly enhances model performance, underscoring the foundational role of localization ability in spatial reasoning [16][18]. - The research also highlights that the scale of model parameters does not directly correlate with spatial understanding capabilities, suggesting that simply increasing model size is insufficient [16][20].
VLA空间理解的能力还远未被挖掘!OccVLA的新尝试(上海期智&清华&上交等)
自动驾驶之心· 2025-09-15 23:33
Core Insights - The article discusses the limitations of existing multimodal large language models (MLLMs) in robust 3D spatial understanding, which is crucial for autonomous driving [3][4] - It introduces OccVLA, a novel framework that integrates 3D occupancy representation into a unified multimodal reasoning process, enhancing the model's ability to learn fine-grained spatial structures from 2D visual inputs [3][9] Group 1: Introduction and Challenges - Recent advancements in end-to-end autonomous driving technology have highlighted the gap between 2D and 3D perception, which limits the widespread application of visual-language models (VLMs) in complex driving scenarios [4][5] - Two main challenges are identified: the difficulty in constructing usable and effective 3D representations without expensive manual annotations, and the lack of large-scale 3D visual-language pre-training that results in loss of fine-grained spatial details [5][8] Group 2: OccVLA Framework - OccVLA is designed to perform occupancy prediction, visual-language reasoning, and action generation tasks simultaneously, addressing the sparsity of occupancy representation and enhancing 3D understanding capabilities [9][18] - The framework employs a cross-attention mechanism to receive visual features from the VLM's intermediate layers, allowing for effective integration of occupancy tokens into the reasoning process without additional computational overhead [9][20] Group 3: Performance and Contributions - OccVLA has demonstrated superior performance in various perception and planning tasks, achieving state-of-the-art results on the nuScenes dataset for trajectory planning and 3D visual question answering [10][11] - The main contributions of the article include the introduction of the OccVLA framework, the design of a cross-modal attention mechanism that allows skipping the occupancy prediction process during inference, and the achievement of competitive results in trajectory planning tasks [11][36] Group 4: Experimental Results - The experiments utilized the nuScenes dataset, which includes 700 training scenes and 150 validation scenes, to evaluate the model's capabilities in 3D localization, target querying, and relational comparison tasks [35][36] - OccVLA's motion planning capabilities were compared with several baseline models, showing that it achieves optimal performance with only camera input and occupancy information as supervision, outperforming models that rely on more complex input data [37][38] Group 5: Visual Question Answering - The model was tested on the challenging NuScenes-QA benchmark dataset, demonstrating its ability to learn 3D understanding from pure visual input, surpassing larger models that depend on LiDAR data or explicit ground truth occupancy information [41][42] - The results indicate that OccVLA effectively integrates occupancy supervision to enhance its 3D reasoning capabilities in autonomous driving scenarios [41][45]
从「对口型」到「会表演」,刚进化的可灵AI数字人,技术公开了
机器之心· 2025-09-15 12:19
Core Viewpoint - The article discusses the advancements made by Kuaishou's Keling team in creating a new digital human generation paradigm, specifically through the Kling-Avatar project, which allows for expressive and natural performances in long videos, moving beyond simple lip-syncing to full-body expressions and emotional engagement [2][31]. Group 1: Technology and Framework - The Kling-Avatar utilizes a two-stage generative framework powered by a multimodal large language model, enabling the transformation of audio, visual, and textual inputs into coherent storylines for video generation [6][10]. - A multimodal director module organizes inputs into a structured narrative, extracting voice content and emotional trajectories from audio, identifying human features and scene elements from images, and integrating user text prompts into actions and emotional expressions [8][10]. - The system generates a blueprint video that outlines the overall rhythm, style, and key expression nodes, which is then used to create high-quality sub-segment videos [12][28]. Group 2: Data and Training - The Keling team collected thousands of hours of high-quality video data from various sources, including speeches and dialogues, to train multiple expert models for assessing video quality across several dimensions [14]. - A benchmark consisting of 375 reference image-audio-text prompt pairs was created to evaluate the effectiveness of the digital human video generation methods, providing a challenging testing scenario for multimodal instruction following [14][23]. Group 3: Performance and Results - The Kling-Avatar demonstrated superior performance in a comparative evaluation against advanced products like OmniHuman-1 and HeyGen, achieving higher scores in overall effectiveness, lip sync accuracy, visual quality, control response, and identity consistency [16][24]. - The generated lip movements were highly synchronized with audio, and facial expressions adapted naturally to vocal variations, even during complex phonetic sounds [25][26]. - Kling-Avatar's ability to generate long videos efficiently was highlighted, as it can produce multiple segments in parallel from a single blueprint video, maintaining quality and coherence throughout [28]. Group 4: Future Directions - The Keling team aims to continue exploring advancements in high-resolution video generation, fine-tuned motion control, and complex multi-turn instruction understanding, striving to imbue digital humans with a genuine and captivating presence [31].
视觉强化学习最新综述:全领域梳理(新加坡国立&浙大&港中文)
自动驾驶之心· 2025-08-16 00:03
Core Insights - The article discusses the integration of Reinforcement Learning with Computer Vision, marking a paradigm shift in how AI interacts with visual data [3][4] - It highlights the potential for AI to not only understand but also create and optimize visual content based on human preferences, transforming AI from passive observers to active decision-makers [4] Research Background and Overview - The emergence of Visual Reinforcement Learning (VRL) is driven by the successful application of Reinforcement Learning in Large Language Models (LLMs) [7] - The article identifies three core challenges in the field: stability in policy optimization under complex reward signals, efficient processing of high-dimensional visual inputs, and scalable reward function design for long-term decision-making [7][8] Theoretical Foundations of Visual Reinforcement Learning - The theoretical framework for VRL includes formalizing the problem using Markov Decision Processes (MDP), which unifies text and visual generation RL frameworks [15] - Three main alignment paradigms are proposed: RL with human feedback (RLHF), Direct Preference Optimization (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR) [16][18] Core Applications of Visual Reinforcement Learning - The article categorizes VRL research into four main areas: Multimodal Large Language Models (MLLM), Visual Generation, Unified Models, and Visual-Language-Action (VLA) Models [31] - Each area is further divided into specific tasks, with representative works analyzed for their contributions [31][32] Evaluation Metrics and Benchmarking - A layered evaluation framework is proposed, detailing specific benchmarks for each area to ensure reproducibility and comparability in VRL research [44][48] - The article emphasizes the need for effective metrics that align with human perception and can validate the performance of VRL systems [61] Future Directions and Challenges - The article outlines four key challenges for the future of VRL: balancing depth and efficiency in reasoning, addressing long-term RL in VLA tasks, designing reward models for visual generation, and improving data efficiency and generalization capabilities [50][52][54] - It suggests that future research should focus on integrating model-based planning, self-supervised visual pre-training, and adaptive curriculum learning to enhance the practical applications of VRL [57]
自动驾驶大模型方案:视觉语言模型VLM工作一览,面向量产和研究~
自动驾驶之心· 2025-08-06 23:34
Core Insights - The article emphasizes the transformative potential of Vision-Language Models (VLMs) in enhancing the perception and cognitive capabilities of autonomous driving systems, enabling them to not only "see" but also "understand" complex driving environments [2][3]. Group 1: VLM Applications in Autonomous Driving - VLMs can surpass traditional visual models by integrating camera images or video streams to comprehend semantic information in traffic scenes, such as recognizing complex scenarios like "a pedestrian waving to cross the street" [6]. - VLMs facilitate the conversion of intricate visual scenes into clear natural language descriptions, enhancing the interpretability of decisions made by autonomous systems, which aids in debugging and increases trust among passengers and regulators [6]. - VLMs are crucial for natural language interactions in future smart cabins, allowing passengers to communicate intentions to vehicles through spoken commands [6]. Group 2: Scenario Generation and Testing - The article introduces CrashAgent, a multi-agent framework that utilizes multi-modal large language models to convert accident reports into structured scenarios for simulation environments, addressing the long-tail distribution issue in existing datasets [7]. - CurricuVLM is proposed as a personalized curriculum learning framework that leverages VLMs to analyze agent behavior and dynamically generate tailored training scenarios, improving safety in autonomous driving [13]. - TRACE is a framework that generates key test cases from real accident reports, significantly enhancing the efficiency of defect detection in autonomous driving systems [17]. Group 3: Out-of-Distribution (OOD) Scenario Generation - A framework utilizing large language models is proposed to generate diverse OOD driving scenarios, addressing the challenges posed by the sparsity of such scenarios in urban driving datasets [21][22]. - The article discusses the development of a method to automatically convert real-world driving videos into detailed simulation scenarios, enhancing the testing of autonomous driving systems [26]. Group 4: Enhancing Safety and Robustness - WEDGE is introduced as a synthetic dataset created from generative vision-language models, aimed at improving the robustness of perception systems in extreme weather conditions [39][40]. - LKAlert is a predictive alert system that utilizes VLMs to forecast potential lane-keeping assist (LKA) risks, enhancing driver situational awareness and trust [54][55]. Group 5: Advancements in Decision-Making Frameworks - The CBR-LLM framework combines semantic scene understanding with case retrieval to enhance decision-making in complex driving scenarios, improving accuracy and reasoning consistency [44][45]. - ORION is presented as a holistic end-to-end autonomous driving framework that integrates visual-language instructed action generation, achieving superior performance in closed-loop evaluations [69][70].