大型视觉语言模型
Search documents
大模型破译甲骨文创下新SOTA!复旦团队推出新框架
量子位· 2025-09-07 04:36
Core Viewpoint - The article presents a novel explainable framework for deciphering oracle bone script based on radical and pictographic analysis, achieving state-of-the-art (SOTA) accuracy in character recognition and demonstrating strong zero-shot capabilities [1][5][71]. Group 1: Methodology and Framework - The proposed method integrates radical recognition and pictographic semantic understanding to bridge the gap between the visual forms and meanings of oracle bone characters [5][71]. - A progressive training strategy is introduced, guiding the model from radical identification to pictographic analysis, culminating in a joint analysis to enhance the deciphering process [6][15][22]. - The framework employs a dual matching mechanism that selects appropriate candidates from a dictionary based on analysis results, improving zero-shot performance [28][71]. Group 2: Dataset and Training - The research team developed the PD-OBS dataset, which includes 47,157 Chinese characters annotated with oracle bone images and pictographic analysis texts, providing a valuable resource for future studies [9][73]. - The dataset comprises characters linked to oracle bone images, ancient script images, and modern standard script images, with annotations for radical and pictographic analysis [10][73]. Group 3: Experimental Results - The new method was evaluated against existing methods on the HUST-OBC and EV-OBC datasets, showing competitive Top-1 and Top-10 accuracy rates, particularly excelling in zero-shot scenarios [38][45]. - In zero-shot settings, the proposed method outperformed all other methods, achieving a Top-10 accuracy improvement of 26.2% on the HUST-OBC dataset and 13.6% on the EV-OBC dataset [45][46]. - The explainability of the model's outputs was quantitatively assessed using BERT-Score, demonstrating higher reliability compared to other large visual language models [47][50]. Group 4: Qualitative Analysis - The model exhibited strong recognition capabilities in both validation and zero-shot settings, generating semantically reasonable predictions for characters that have not been deciphered by human experts [66][68]. - The dual analysis of radicals and pictographs provided a comprehensive visual-semantic mapping, enhancing the model's ability to produce interpretable outputs even for undeciphered characters [68][70].
复旦最新LMAD:迈向可解释端到端VLM~
自动驾驶之心· 2025-08-19 23:32
Core Viewpoint - The article discusses the LMAD framework, which significantly enhances the reasoning performance of visual language models (VLMs) in autonomous driving by addressing existing limitations in scene understanding and spatial perception [2][3]. Existing Method Limitations - Current VLM-based autonomous driving methods face two key issues: fragmented scene understanding, which relies on intermediate results and fails to capture relationships between traffic elements, and weak spatial and motion perception, leading to accumulated errors during inference [4]. Innovations of LMAD - The LMAD framework introduces several core innovations: - Preliminary Interaction (PI) mechanism to model initial relationships among traffic participants, reducing the learning complexity of VLMs [6]. - Task-specific expert structure using parallel LoRA (P-LoRA) modules to focus VLMs on specific tasks such as perception, prediction, and planning [6]. - End-to-end system integration that incorporates prior knowledge from end-to-end driving systems to enhance spatial and motion information for improved reasoning capabilities [6]. Overall Framework - LMAD integrates an end-to-end driving pipeline with visual language models, consisting of three main components: a visual language model for image and text token processing, a PI encoder for multi-view image handling, and a P-LoRA module for task-specific knowledge integration [8][10]. Key Module Design - The PI encoder addresses redundancy in multi-view image processing by employing a decoupled query and alternating attention mechanism [12][15]. - The P-LoRA design allows for multiple parallel branches corresponding to different driving tasks, enhancing adaptability [16]. Training Strategy - The training strategy includes single-branch fine-tuning, where only the language branch is adjusted, and joint training, which optimizes both text generation and end-to-end tasks simultaneously [18]. Experimental Results - In the DriveLM benchmark, LMAD significantly improved the performance of baseline VLMs, with accuracy increases of 3.44% for LLaMA-Adapter and 3.89% for GPT [20]. - In the nuScenes-QA test, LMAD achieved an overall accuracy improvement of 2.57% compared to the baseline [25]. Ablation Studies - The effectiveness of components such as the PI encoder, P-LoRA, and end-to-end tokens was confirmed, with the full configuration yielding the highest final score of 57.17 [28]. - The task-oriented P-LoRA design outperformed other configurations across various metrics [28]. Qualitative Analysis - LMAD demonstrated strong performance in perception tasks by accurately identifying key targets, although it struggled with less obvious signs [34]. - In prediction tasks, LMAD effectively influenced subsequent planning despite discrepancies between predicted and actual targets [34]. - For planning tasks, LMAD produced driving behaviors that aligned with the current environment by leveraging historical context [34].