多模态大模型(VLM)
Search documents
6款小游戏难倒所有顶级VLM!愤怒的小鸟让它们全军覆没,性能不如随机猜测
量子位· 2025-11-16 04:45
Core Insights - The article introduces DeepPHY, the first comprehensive benchmark designed to systematically evaluate the interactive physical reasoning capabilities of Vision-Language Models (VLMs) [1][5][10] - Despite advancements in VLMs for dynamic interaction environments, significant limitations remain in their ability to translate physical knowledge into precise and predictable control actions [4][7][29] Group 1: DeepPHY Overview - DeepPHY integrates six distinct physical challenge environments, ranging from fundamental physics to complex dynamics, to assess VLMs' interactive physical reasoning [12][19] - The benchmark reveals that existing VLMs struggle with physical interaction, planning, and environmental adaptation, often performing similarly to random action execution [10][18][29] Group 2: Benchmark Environments - The six environments included in DeepPHY are PHYRE, I-PHYRE, Kinetix, Pooltool, Angry Birds, and Cut the Rope, each focusing on different aspects of physical reasoning [12][13][19] - Each environment is designed to test various dimensions of physical understanding, such as collision, gravity, and multi-body dynamics, with specific tasks that require strategic planning and real-time adaptation [14][19] Group 3: Performance Evaluation - A comprehensive evaluation of 17 mainstream VLMs, including both open-source and closed-source models, demonstrated widespread limitations in their physical reasoning capabilities [16][17] - The results indicated that many models could not surpass a baseline of random action execution, highlighting a fundamental disconnect between descriptive physical knowledge and actionable control signals [18][29] Group 4: Key Findings - The study found that VLMs often fail to learn effectively from unsuccessful attempts, indicating an inability to construct accurate internal models of the physical world [22][29] - The performance of VLMs significantly declines as task complexity increases, revealing vulnerabilities in processing complex information and executing precise strategies [22][24] Group 5: Implications for Future AI Development - The findings suggest that current VLMs possess descriptive knowledge of physics but lack the predictive and procedural capabilities necessary for effective interaction with the physical world [29][30] - The authors hope that DeepPHY will serve as a rigorous benchmark to encourage the development of AI agents that truly understand and can interact with physical environments [30]