物理推理
Search documents
6款小游戏难倒所有顶级VLM!愤怒的小鸟让它们全军覆没,性能不如随机猜测
量子位· 2025-11-16 04:45
Core Insights - The article introduces DeepPHY, the first comprehensive benchmark designed to systematically evaluate the interactive physical reasoning capabilities of Vision-Language Models (VLMs) [1][5][10] - Despite advancements in VLMs for dynamic interaction environments, significant limitations remain in their ability to translate physical knowledge into precise and predictable control actions [4][7][29] Group 1: DeepPHY Overview - DeepPHY integrates six distinct physical challenge environments, ranging from fundamental physics to complex dynamics, to assess VLMs' interactive physical reasoning [12][19] - The benchmark reveals that existing VLMs struggle with physical interaction, planning, and environmental adaptation, often performing similarly to random action execution [10][18][29] Group 2: Benchmark Environments - The six environments included in DeepPHY are PHYRE, I-PHYRE, Kinetix, Pooltool, Angry Birds, and Cut the Rope, each focusing on different aspects of physical reasoning [12][13][19] - Each environment is designed to test various dimensions of physical understanding, such as collision, gravity, and multi-body dynamics, with specific tasks that require strategic planning and real-time adaptation [14][19] Group 3: Performance Evaluation - A comprehensive evaluation of 17 mainstream VLMs, including both open-source and closed-source models, demonstrated widespread limitations in their physical reasoning capabilities [16][17] - The results indicated that many models could not surpass a baseline of random action execution, highlighting a fundamental disconnect between descriptive physical knowledge and actionable control signals [18][29] Group 4: Key Findings - The study found that VLMs often fail to learn effectively from unsuccessful attempts, indicating an inability to construct accurate internal models of the physical world [22][29] - The performance of VLMs significantly declines as task complexity increases, revealing vulnerabilities in processing complex information and executing precise strategies [22][24] Group 5: Implications for Future AI Development - The findings suggest that current VLMs possess descriptive knowledge of physics but lack the predictive and procedural capabilities necessary for effective interaction with the physical world [29][30] - The authors hope that DeepPHY will serve as a rigorous benchmark to encourage the development of AI agents that truly understand and can interact with physical environments [30]
开源模型首次物理奥赛IPhO夺金!上海AI Lab 235B模型击败GPT-5和Grok-4
量子位· 2025-10-25 06:23
Core Insights - The open-source model P1-235B-A22B has won a gold medal at the International Physics Olympiad (IPhO), marking a significant achievement for open-source AI in complex physical reasoning [1][20]. - In the HiPhO benchmark test covering 13 global physics competitions from 2024 to 2025, P1-235B-A22B achieved 12 gold and 1 silver medal, tying for first place with Google's Gemini-2.5-Pro [3][19]. - The performance of P1-235B-A22B surpasses that of other models like GPT-5 and Grok-4, indicating that open-source models have reached or exceeded the capabilities of closed-source models in physical reasoning [5][19]. Benchmark Testing - The HiPhO benchmark test was developed to evaluate the performance of physics competition models, aligning closely with human assessment standards [7][8]. - The benchmark includes 13 major physics competitions, ensuring a comprehensive evaluation of model performance against human competitors [7][8]. Training Methodology - P1 series models utilize a multi-stage reinforcement learning process, which includes strategies like context window expansion and pass rate filtering to enhance training efficiency [10][11][12]. - The training dataset consists of thousands of competition-level problems, each with complete context, verifiable answers, and standard solution processes [9]. Multi-Agent System - The PhysicsMinions system, designed for collaborative evolution in physical reasoning, consists of three interactive modules that improve solution quality through self-verification and iterative reflection [13][14]. - This system has demonstrated significant improvements in the reasoning quality and robustness of complex physical problems [13][14]. Performance Results - P1-235B-A22B achieved an average score of 35.9 in the HiPhO benchmark, which increased to 38.4 after integrating the PhysicsMinions system, outperforming other leading models [21]. - The model's performance in various domains, including mathematics and coding, has shown significant advantages, indicating strong generalization capabilities [22].