Workflow
跨模态知识图谱
icon
Search documents
港科广×腾讯联手打造《我的世界》神操作,400张截图就能让AI挖矿通关,成本降至5%|EMNLP 2025
量子位· 2025-09-04 04:41
Core Insights - The article presents the innovative VistaWise framework developed by a joint team from Hong Kong University of Science and Technology (Guangzhou) and Tencent, which aims to enhance the capabilities of AI in complex open-world environments like Minecraft [2][6]. Group 1: Framework Overview - VistaWise integrates "cross-modal knowledge graphs" and "lightweight visual fine-tuning" to enable AI agents to operate effectively in open-world scenarios [3][6]. - The framework achieved a 33% success rate in the "diamond acquisition" task, surpassing previous state-of-the-art (SOTA) methods by 8 percentage points, with all nine sub-tasks exceeding a 73% success rate [4][18]. Group 2: Methodology - The research team utilized only 471 game screenshots and a consumer-grade GPU with 24 GB VRAM for visual model fine-tuning, significantly reducing training costs and complexity [6][17]. - A lightweight knowledge graph was constructed from text guides and encyclopedic knowledge, which was injected into the large model to minimize hallucinations [7][11]. - The "retrieval-based pooling" mechanism allows the model to quickly access task-relevant information, enhancing efficiency [13]. Group 3: Performance Metrics - VistaWise's training data volume was reduced by five orders of magnitude (471 vs. 160 million frames), and GPU memory requirements decreased by 87.5% (24 GB vs. 192 GB) [18]. - Compared to multi-modal large models, VistaWise's approach resulted in a 30.7% reduction in token usage while maintaining performance levels [18]. Group 4: Decision-Making Process - The decision-making cycle of VistaWise consists of four steps: perception, retrieval, reasoning, and execution [15][20]. - The system operates entirely on a local setup with an 8 GB GPU during the inference phase, demonstrating its efficiency and accessibility [17].