图像目标导航的核心究竟是什么？

Research Background and Core Issues - Image goal navigation requires two key capabilities: core navigation skills and direction information calculation based on visual observation and target image comparison [2] - The research focuses on whether this task can be efficiently solved through end-to-end training of complete agents using reinforcement learning (RL) [2] Core Research Content and Methods - The study explores various architectural designs and their impact on task performance, emphasizing implicit correspondence computation between images [3][4] - Key architectures discussed include Late Fusion, ChannelCat, SpaceToDepth + ChannelCat, and Cross-attention [4] Main Findings - Early patch-level fusion methods (like ChannelCat and Cross-attention) are more critical than late fusion methods (Late Fusion) for supporting implicit correspondence computation [8] - The performance of different architectures varies significantly under different simulator settings, particularly the "Sliding" setting [8][10] Performance Metrics - The success rate (SR) and success path length (SPL) metrics are used to evaluate the performance of various models [7] - For example, when Sliding=True, ChannelCat (ResNet9) achieved an SR of 83.6%, while Late Fusion only reached 13.8% [8] Transferability of Abilities - Some learned capabilities can transfer to more realistic environments, especially when including the weights of the perception module [10] - Training with Sliding=True and then fine-tuning in a Sliding=False environment improved SR from 31.7% to 38.5% [10] Relationship Between Navigation and Relative Pose Estimation - A correlation exists between navigation performance and relative pose estimation accuracy, indicating the importance of direction information extraction in image goal navigation [12] Conclusion - Architectural designs that support early local fusion (like Cross-attention and ChannelCat) are crucial for implicit correspondence computation [15] - The simulator's Sliding setting significantly affects performance, but transferring perception module weights can help retain some capabilities in real-world scenarios [15] - Navigation performance is related to relative pose estimation ability, confirming the core role of direction information extraction in image goal navigation [15]