人工智能模型测评 - filings, earnings calls, financial reports, news

人工智能模型测评

Search documents

小熊跑的快· 2025-11-18 12:22

Core Insights - Gemini 3 Pro significantly outperforms its predecessor, Gemini 2.5 Pro, across various benchmarks, showcasing enhanced reasoning and multimodal capabilities [2]. Benchmark Performance - In the "Humanity's Last Exam" benchmark, Gemini 3 Pro achieved a score of 37.5%, compared to 21.6% for Gemini 2.5 Pro [2]. - For visual reasoning puzzles in the ARC-AGI-2 benchmark, Gemini 3 Pro scored 31.1%, while Gemini 2.5 Pro only managed 4.9% [2]. - In scientific knowledge assessment (GPQA Diamond), Gemini 3 Pro scored 91.9%, outperforming Gemini 2.5 Pro's 86.4% [2]. - In mathematics (AIME 2025), Gemini 3 Pro achieved 95.0%, while Gemini 2.5 Pro scored 88.0% [2]. - The MathArena Apex benchmark showed Gemini 3 Pro at 23.4%, a significant improvement over Gemini 2.5 Pro's 0.5% [2]. - For multimodal understanding (MMMU-Pro), Gemini 3 Pro scored 81.0%, compared to 68.0% for Gemini 2.5 Pro [2]. - In screen understanding (ScreenSpot-Pro), Gemini 3 Pro achieved 72.7%, while Gemini 2.5 Pro scored only 11.4% [2]. - The performance in OCR (OmniDocBench 1.5) showed Gemini 3 Pro with an edit distance of 0.115, better than Gemini 2.5 Pro's 0.145 [2]. - Knowledge acquisition from videos (Video-MMMU) resulted in 87.6% for Gemini 3 Pro, compared to 83.6% for Gemini 2.5 Pro [2]. - Competitive coding problems (LiveCodeBench Pro) saw Gemini 3 Pro with an Elo Rating of 2,439, significantly higher than Gemini 2.5 Pro's 1,775 [2]. - In agentic terminal coding (Terminal-Bench 2.0), Gemini 3 Pro scored 54.2%, while Gemini 2.5 Pro scored 32.6% [2]. - The agentic coding benchmark (SWE-Bench Verified) showed Gemini 3 Pro at 76.2%, compared to 59.6% for Gemini 2.5 Pro [2]. - For long-horizon agent tasks (Vending-Bench 2), Gemini 3 Pro's net worth was $5,478.16, vastly exceeding Gemini 2.5 Pro's $573.64 [2]. - In multilingual Q&A (MMMLU), Gemini 3 Pro scored 91.8%, slightly ahead of Gemini 2.5 Pro's 89.5% [2]. - The commonsense reasoning benchmark (Global PIQA) showed Gemini 3 Pro at 93.4%, compared to 91.5% for Gemini 2.5 Pro [2]. - Long context performance (MRCR v2) indicated Gemini 3 Pro at 77.0% for 128k context, significantly better than Gemini 2.5 Pro's 58.0% [2].

重新体验GPT-5后，我想它比GPT-4o更需要一场葬礼

Hu Xiu· 2025-08-11 12:57

Core Insights - The release of GPT-5 has not met user expectations, leading to disappointment compared to its predecessor GPT-4o [1][14][106] - OpenAI has reintroduced GPT-4o due to user demand, indicating dissatisfaction with GPT-5 [2][6][108] Performance Comparison - GPT-5 performs better in technical tasks but struggles with tasks requiring human-like understanding and emotional nuance, making it less effective for everyday productivity tasks [16][20][22] - In creative tasks, GPT-5 has not shown significant improvement over GPT-4o, producing formulaic outputs lacking originality [18][80] - User experience with GPT-5 is perceived as less empathetic and more robotic, affecting its ability to engage in meaningful conversations [19][91][98] Testing Methodology - A rigorous testing process was designed to compare GPT-5 and GPT-4o across various tasks, focusing on speed, accuracy, usability, and user experience [10][11][12] - The tests included generating emails, data analysis, and creative writing, with results documented for direct comparison [9][21][33] User Feedback - Users have expressed frustration with GPT-5's performance, often stating it is less useful than GPT-4o, leading to a metaphorical "funeral" for the older model [4][5][107] - The community's reaction has been overwhelmingly critical, with many users preferring the older model for its reliability and effectiveness [7][8][108] Conclusion - The overall sentiment is that GPT-5, while faster, does not provide a substantial upgrade over GPT-4o, leading to calls for a reassessment of its capabilities and user experience [14][106][110]

人工智能模型测评

Artificial Intelligence

Artificial Intelligence

GPT-5

GPT-4o

Claude 3 Sonnet