VisionTrap: VLM+LLM教会模型利用视觉特征更好实现轨迹预测

Core Insights - The article presents a novel method for trajectory prediction in autonomous driving, integrating visual inputs from surround cameras and textual descriptions to enhance prediction accuracy [3][4][5] - The proposed approach addresses limitations of traditional methods that rely solely on HD maps and historical trajectories, which often lack real-time adaptability to changing environments [5][6] - The introduction of a new dataset, nuScenes-Text, enriches existing datasets with textual annotations, demonstrating the positive impact of visual language models (VLM) on trajectory prediction [4][6][37] Group 1: Methodology - The proposed model consists of four key components: Per-agent State Encoder, Visual Semantic Encoder, Text-driven Guidance Module, and Trajectory Decoder [7][10] - The Per-agent State Encoder captures temporal features and spatial interactions among agents, utilizing relative displacement and attention mechanisms [10][11] - The Visual Semantic Encoder extracts image features from the environment, integrating them with agent features to enhance prediction accuracy [14][16] Group 2: Data and Training - The nuScenes-Text dataset was created using fine-tuned VLM and large language models (LLM) to generate detailed textual descriptions for each agent in various scenarios [37][39] - The training process employs multi-modal contrastive learning to align visual features with textual descriptions, improving the model's ability to extract relevant information from images [19][25] - The model's training objective includes maximizing similarity between positive pairs (agent features and corresponding text) while minimizing similarity between negative pairs (features from different agents) [19][20] Group 3: Experimental Results - The experimental results indicate significant improvements in trajectory prediction accuracy, with enhancements of over 20% attributed to the Visual Semantic Encoder and Text-driven Guidance Module [46][47] - The model's performance was validated across the entire nuScenes dataset, showcasing the effectiveness of each component in improving prediction metrics [47][48] - Visual and textual information integration led to better clustering of agent state embeddings, indicating improved understanding of agent behaviors [49][50] Group 4: Conclusion - The key innovation of the proposed method lies in using textual descriptions to guide the model in learning visual semantic features, thereby enhancing trajectory prediction accuracy [53][54] - The article highlights the importance of image information in trajectory prediction and the effectiveness of the proposed approach in leveraging both visual and textual data [54]