Workflow
算法元定理
icon
Search documents
GPT-5、Grok 4、o3 Pro都零分,史上最难AI评测基准换它了
机器之心· 2025-08-15 04:17
Core Viewpoint - The recent performance of leading AI models in the FormulaOne benchmark indicates that they struggle significantly with complex reasoning tasks, raising questions about their capabilities in solving advanced scientific problems [2][10][12]. Group 1: AI Model Performance - Google and OpenAI's models achieved gold medal levels in the International Mathematical Olympiad (IMO), suggesting potential for high-level reasoning [2]. - The FormulaOne benchmark, developed by AAI, resulted in zero scores for several advanced models, including GPT-5 and Gemini 2.5 Pro, highlighting their limitations in tackling complex graph structure dynamic programming problems [2][3]. - The overall success rates for the models in the benchmark were notably low, with GPT-5 achieving only 3.33% success overall, and all models scoring 0% in the deepest difficulty category [3][10][12]. Group 2: Benchmark Structure - The FormulaOne benchmark consists of 220 novel graph structure dynamic programming problems categorized into three levels: shallow, deeper, and deepest [3][4]. - The shallow category includes 100 easier problems, while the deeper category contains 100 challenging problems, and the deepest category has 20 highly challenging problems [4]. Group 3: AAI Company Overview - AAI, founded by Amnon Shashua in August 2023, focuses on advancing Artificial Expert Intelligence (AEI), which combines domain knowledge with rigorous scientific reasoning [14][18]. - The company aims to overcome traditional AI limitations by enabling AI to solve complex scientific or engineering problems like top human experts [19]. - Within its first year, AAI attracted significant investment and was selected for the AWS 2024 Generative AI Accelerator program, receiving $1 million in computing resources [19].