谷歌Gemini 3把GPT-5.1打成计量单位！马斯克奥特曼都服了

Core Insights - Google Gemini 3 Pro shows significant advancements over its predecessor, Gemini 2.5 Pro, outperforming GPT-5.1 and Claude 4.5 in nearly all benchmark tests, including academic reasoning and visual reasoning puzzles [1][2]. Benchmark Performance - In "Humanity's Last Exam," Gemini 3 Pro scored 37.5% without tools and 45.8% with search and code execution, compared to 21.6% for Gemini 2.5 Pro [2]. - For the ARC-AGI-2 visual reasoning puzzles, Gemini 3 Pro achieved 31.1%, a substantial increase from 4.9% in Gemini 2.5 Pro [2]. - In mathematics, Gemini 3 Pro scored 95.0% in AIME 2025 without tools and achieved a perfect score of 100% with code execution [2]. - The LiveCodeBench Pro benchmark saw Gemini 3 Pro with an Elo Rating of 2,439, significantly higher than Gemini 2.5 Pro's 1,775 [2]. Model Evolution - The Gemini series has evolved significantly, with each generation addressing the shortcomings of the previous one. The first generation established multimodal capabilities, while the second focused on decision-making and planning [15][18]. - Gemini 2.5 introduced a reasoning engine for deeper reasoning and problem-solving, leading to the current generation, which integrates multimodal, reasoning, and agent capabilities [19][20]. User Interaction and Usability - Gemini 3 Pro is designed to understand user intent better, allowing for more straightforward interactions without the need for complex prompts [21]. - The model can seamlessly process text, images, videos, audio, and code, enhancing its usability across various applications [23]. Development Platform - Google introduced the Antigravity platform alongside Gemini 3 Pro, aimed at simplifying the development process for AI agents, allowing developers to focus on higher-level tasks [29][33]. - Antigravity supports multiple models, including third-party options, and has attracted significant developer interest due to its generous rate limits [33]. Future Developments - A more advanced version, Gemini 3 Deep Think, is in development, promising further enhancements in capabilities [13][14].