Workflow
Gemini 3 Pro Preview
icon
Search documents
最强大模型的视觉能力不如6岁小孩
3 6 Ke· 2026-01-22 13:10
Core Insights - The current state of visual reasoning in AI models, particularly Gemini 3 Pro Preview, is still significantly below human capabilities, with a performance level comparable to a three-year-old child, and a 20% gap from six-year-olds [1][7][4] - Gemini 3 Pro Preview is considered the leading model among existing AI systems, outperforming others like GPT-5.2 and Claude 4.5 Opus, which perform even worse than a three-year-old [5][10] - The research highlights the limitations of current visual reasoning models, emphasizing the need for a fundamental reconstruction of visual capabilities rather than relying on language-based translations [7][19] Performance Comparison - In closed-source models, Gemini 3 Pro Preview leads with a score of 49.7%, followed by GPT-5.2 at 34.4% and Doubao-Seed-1.8 at 30.2% [10] - Other models such as Qwen3-VL-Plus, Grok-4, and Claude-4.5 Opus scored significantly lower, with scores of 19.2%, 16.2%, and 14.2% respectively [11] - The best-performing open-source model, Qwen3VL-235B-Thinking, achieved a score of 22.2%, indicating that even the largest open-source models cannot compete with top closed-source systems [12][13] Challenges in Visual Reasoning - The research identifies four core challenges faced by multi-modal large language models (MLLMs) in visual reasoning: 1. **Fine-grained Discrimination**: Difficulty in detecting subtle visual differences [19] 2. **Visual Tracking**: Inability to maintain perceptual consistency over long distances [22] 3. **Spatial Perception**: Challenges in constructing stable three-dimensional representations from two-dimensional images [28] 4. **Visual Pattern Recognition**: Struggles in generalizing rules from limited visual examples [34] Proposed Solutions - The study suggests two potential directions for improving visual reasoning capabilities: 1. **Reinforcement Learning with Verifiable Rewards (RLVR)**: This approach showed an overall accuracy improvement of approximately 4.8 percentage points after fine-tuning, particularly in fine-grained discrimination and spatial perception tasks [36] 2. **Generative Modeling**: The introduction of BabyVision-Gen evaluated three advanced visual generative models, with NanoBanana-Pro achieving the highest accuracy of 18.3% [38][39] Future Trends - The research indicates a shift towards unified architectures that bypass the "language bottleneck," allowing for high-fidelity visual representations during reasoning processes [44] - Models like Bagel, Sora 2, and Veo 3 demonstrate the potential for generative methods to serve as advanced forms of reasoning, emphasizing the importance of maintaining visual integrity in AI systems [44]
最强大模型的视觉能力不如6岁小孩
量子位· 2026-01-22 11:13
Core Insights - The current state of visual reasoning in AI models is still significantly behind human capabilities, with the best model, Gemini 3 Pro Preview, only slightly outperforming a three-year-old child and lagging 20% behind a six-year-old child [2][10] - The performance of Gemini 3 Pro Preview is noted as the highest among existing models, with a score of 49.7%, while other leading models like GPT-5.2 and Claude 4.5 Opus show even poorer results [6][14] - The article emphasizes the need for future models to rebuild visual capabilities from the ground up rather than relying on language-based translations of visual problems [11] Performance Comparison - In closed-source models, Gemini 3 Pro Preview leads with 49.7%, followed by GPT-5.2 at 34.4% and Doubao-Seed-1.8 at 30.2% [14] - Other models such as Qwen3-VL-Plus, Grok-4, and Claude-4.5-Opus scored significantly lower, indicating a general underperformance in visual reasoning tasks [15] - The best-performing open-source model, Qwen3VL-235B-Thinking, achieved a score of 22.2%, still far behind the top closed-source systems [16] Challenges in Visual Reasoning - The article identifies four core challenges faced by multi-modal large language models (MLLMs) in visual reasoning: 1. **Lack of Non-verbal Fine Details**: MLLMs struggle to accurately describe fine visual details that cannot be easily expressed in language [25] 2. **Loss of Manifold Consistency**: MLLMs often fail to maintain perceptual consistency over long distances, leading to errors in tasks involving spatial relationships [31] 3. **Spatial Imagination**: MLLMs have difficulty constructing stable three-dimensional representations from two-dimensional images, which affects their ability to perform mental transformations [39] 4. **Visual Pattern Induction**: MLLMs tend to focus on counting attributes rather than understanding the underlying changes in visual examples, limiting their ability to generalize from few examples [47] Proposed Solutions - The research suggests two potential directions to improve visual reasoning: 1. **Reinforcement Learning with Verifiable Rewards (RLVR)**: This approach showed an overall accuracy improvement of 4.8 percentage points after fine-tuning, particularly in fine-grained discrimination and spatial perception tasks [56][58] 2. **Generative Model Approaches**: The study introduces BabyVision-Gen, which evaluates generative models like NanoBanana-Pro, GPT-Image-1.5, and Qwen-Image-Edit, highlighting that while success rates are still low, some models exhibit explicit visual thinking capabilities [60][62] Future Directions - The article concludes that overcoming the "language bottleneck" in visual reasoning is crucial, advocating for unified architectures that retain high-fidelity visual representations during reasoning processes [68][70] - Models like Bagel and Sora 2 demonstrate the potential for generative methods to serve as advanced forms of reasoning, emphasizing the importance of robust visual semantic understanding [71]
全球大模型密集升级强化AI主线,关注恒生科技ETF易方达(513010)等产品投资价值
Mei Ri Jing Ji Xin Wen· 2025-12-08 07:15
Core Insights - Recent advancements in the overseas large model sector indicate a clearer direction for technological evolution, with significant improvements in complex task handling and consumer-level applications [1] - The current wave of large model updates is characterized by three main trends: deepening reasoning, enhancing intelligent agents, and the proliferation of multimodal capabilities, leading to greater reliability and execution power [1] - The long-term investment value of Hong Kong tech leaders is highlighted by the clear commercialization paths of large models and the potential for ecosystem expansion, which may enhance profitability visibility and valuation recovery [1] Industry Developments - The Gemini 3 Pro Preview has introduced deep reasoning modes, significantly boosting its ability to handle complex tasks [1] - The launch of Sora App and breakthroughs from Anthropic Claude Opus 4.5 signify rapid advancements in AI towards consumer applications and high-performance forms [1] - DeepSeek's release of V3.2 and V3.2-Speciale showcases leading reasoning capabilities in the industry, enhancing execution and reasoning efficiency through the integration of "thinking modes + tool invocation" [1] Investment Opportunities - The Hang Seng Tech Index comprises the 30 largest tech-related stocks listed in Hong Kong, focusing on high-growth sectors like AI and the internet, facilitating a "soft and hard collaboration" layout in AI [2] - The CSI Hong Kong Stock Connect Internet Index includes 30 stocks involved in internet-related businesses, covering key players in various AI application fields [2] - Recent inflows into the Hang Seng Tech ETF and Hong Kong Stock Connect Internet ETF have reached record highs of 25.7 billion and 7.3 billion respectively, indicating strong investor interest in large models and AI applications [2]
X @Tesla Owners Silicon Valley
Market Position - xAI's Grok 4.1 Fast claims the 1 position on OpenRouter's Trending Leaderboard [1] - xAI is rapidly gaining market share in the AI industry [1] Model Performance & Adoption - Grok 4.1 Fast's 2 million context window, frontier performance, and free tier contribute to its widespread adoption [1] Model Size Comparison - Grok 4.1 Fast has 275 billion parameters [2] - Gemini 3 Pro Preview has 129 billion parameters [2] - Claude 4.5 Sonnet has 67 billion parameters [2]