Workflow
刚刚,Grok4跑分曝光:「人类最后考试」拿下45%,是Gemini 2.5两倍,但网友不信
机器之心·2025-07-05 02:46

Core Viewpoint - The leaked benchmark results for Grok 4 and Grok 4 Code indicate significant performance improvements, suggesting that the models may surpass competitors in various AI assessments [2][3][26]. Benchmark Results - Grok 4 achieved a standard score of 35% on the Humanities Last Exam (HLE), which improved to 45% with reasoning techniques, outperforming OpenAI's o3 by two times and GPT-4o by four to five times [3][5]. - In GPQA (Graduate-level Physics and Astronomy questions), Grok 4 scored 87-88%, comparable to OpenAI's top performance and exceeding Claude 4 Opus's score of approximately 75% [6]. - Grok 4 scored 95% on the AIME '25 (2025 American Mathematics Olympiad), significantly higher than Claude 4 Opus's 34% and slightly better than OpenAI's o3, which scored between 80-90% depending on reasoning mode [7]. - Grok 4 Code scored 72-75% on SWEBench, matching Claude Opus 4 and slightly surpassing OpenAI's o3 at 71.7% [8]. Model Development and Features - Grok 4 is designed as a generalist model with capabilities in natural language, mathematics, and reasoning, and it completed training on June 29 [17]. - The model supports approximately 130,000 tokens in context, indicating a focus on optimizing reasoning speed rather than maximizing long-context performance [16]. - Grok 4 Code is tailored for programming tasks, allowing users to ask coding questions directly [18]. Development Process - Elon Musk has been heavily involved in the development of Grok 4, reportedly working overnight to ensure the model's progress, which he described as going well but still requiring final large-scale training [20][23]. - The recent benchmark scores have generated excitement and speculation about the potential release of Grok 4, with expectations that it may be officially announced soon [25][26].