倒反天罡!Gemini Flash表现超越Pro,“帕累托前沿已经反转了”
量子位·2025-12-22 08:01

Core Insights - Gemini 3 Flash outperforms its predecessor Gemini 2.5 Pro and even the flagship Gemini 3 Pro in various benchmarks, achieving a score of 78% in the SWE-Bench Verified test, surpassing Gemini 3 Pro's score of 76.2% [1][6][9] - The performance of Gemini 3 Flash in the AIME 2025 mathematics competition benchmark is notable, scoring 99.7% with code execution capabilities, indicating its advanced mathematical reasoning skills [7][8] - The article emphasizes a shift in perception regarding flagship models, suggesting that smaller, optimized models like Flash can outperform larger models, challenging the traditional belief that larger models are inherently better [19][20] Benchmark Performance - In the Humanity's Last Exam, Flash scored 33.7% without tools, closely trailing Pro's 37.5% [7][8] - Flash's performance in various benchmarks includes: - 90.4% in GPQA Diamond for scientific knowledge [8] - 95.2% in AIME 2025 for mathematics without tools [8] - 81.2% in MMMU-Pro for multimodal understanding [8] - Flash's speed is three times that of Gemini 2.5 Pro, with a 30% reduction in token consumption, making it cost-effective at $0.50 per million tokens for input and $3.00 for output [9] Strategic Insights - Google’s team indicates that the Pro model's role is to "distill" the capabilities of Flash, focusing on optimizing performance and cost [10][12][13] - The evolution of scaling laws is discussed, with a shift from merely increasing parameters to enhancing reasoning capabilities through advanced training techniques [15][16] - The article highlights the importance of post-training as a significant area for future development, suggesting that there is still substantial room for improvement in open-ended tasks [17][18] Paradigm Shift - The emergence of Flash has sparked discussions about the validity of the "parameter supremacy" theory, as it demonstrates that smaller, more efficient models can achieve superior performance [19][21] - The integration of advanced reinforcement learning techniques in Flash is cited as a key factor in its success, proving that increasing model size is not the only path to enhancing capabilities [20][22] - The article concludes with a call to reconsider the blind admiration for flagship models, advocating for a more nuanced understanding of model performance [23]