FindingDory：具身智能体记忆评估的基准测试

Group 1 - The core issue in embodied intelligence is the lack of long-term memory, which limits the ability to process multimodal observational data across time and space [3] - Current visual language models (VLMs) excel in planning and control tasks but struggle with integrating historical experiences in embodied environments [3][5] - Existing video QA benchmarks fail to adequately assess tasks requiring fine-grained reasoning, such as object manipulation and navigation [5] Group 2 - The proposed benchmark includes a task architecture that allows for dynamic environment interaction and memory reasoning validation [4][6] - A total of 60 task categories are designed to cover spatiotemporal semantic memory challenges, including spatial relations, temporal reasoning, attribute memory, and multi-target recall [7] - Key technical innovations include a programmatic expansion of task complexity through increased interaction counts and a strict separation of experience collection from interaction phases [9][6] Group 3 - Experimental results reveal three major bottlenecks in VLM memory capabilities across 60 tasks, including failures in long-sequence reasoning, weak spatial representation, and collapse in multi-target processing [13][14][16] - The performance of native VLMs declines as the number of frames increases, indicating ineffective utilization of long contexts [20] - Supervised fine-tuning models show improved performance by leveraging longer historical data, suggesting a direction for VLM refinement [25] Group 4 - The benchmark represents the first photorealistic embodied memory evaluation framework, covering complex household environments and allowing for scalable assessment [26] - Future directions include memory compression techniques, end-to-end joint training to address the split between high-level reasoning and low-level execution, and the development of long-term video understanding [26]