视觉推理
Search documents
ICLR 2026 | 这道题是否需要用图思考?模型来告诉你!自适应思考模式切换助力通用视觉推理提升
机器之心· 2026-02-05 04:35
本文来自复旦大学和阿里巴巴未来生活实验室,已中稿 ICLR 2026。 目前的视觉推理方法衍生出了多种思考模式,主要有和 LLM 一致的纯文本思考模式以及更加贴近图片的用图思考。两种推理模式在不同的领域各有所长, 但现有的工作聚焦于单个思考模式,无法充分利用两个模式之间的互补性。 因此,本文提出了 mixture-of-visual-thoughts,一种自适应的推理范式:目标是 将不同推理模式整合到一个模型内部并引导其进行自适应的模式选择。 为了让模型学习这样的推理范式,研究者引入了一个两阶段的学习框架 AdaVaR,通过 SFT 学习不同的推理模式,并设计了一个专门的 AdaGRPO 算法来 在强化学习设定下引导模型学习如何根据问题选择合适的推理模式。 背景:视觉推理的不同思考模式 目前对于 LVLM (large vision-language model) 的视觉推理方法已经有了大量的探索,其中主流推理范式包括以下两种: 论文标题: Mixture-of-Visual-Thoughts:Exploring Context-Adaptive Reasoning Mode Selection for Ge ...
最强大模型的视觉能力不如6岁小孩
3 6 Ke· 2026-01-22 13:10
Core Insights - The current state of visual reasoning in AI models, particularly Gemini 3 Pro Preview, is still significantly below human capabilities, with a performance level comparable to a three-year-old child, and a 20% gap from six-year-olds [1][7][4] - Gemini 3 Pro Preview is considered the leading model among existing AI systems, outperforming others like GPT-5.2 and Claude 4.5 Opus, which perform even worse than a three-year-old [5][10] - The research highlights the limitations of current visual reasoning models, emphasizing the need for a fundamental reconstruction of visual capabilities rather than relying on language-based translations [7][19] Performance Comparison - In closed-source models, Gemini 3 Pro Preview leads with a score of 49.7%, followed by GPT-5.2 at 34.4% and Doubao-Seed-1.8 at 30.2% [10] - Other models such as Qwen3-VL-Plus, Grok-4, and Claude-4.5 Opus scored significantly lower, with scores of 19.2%, 16.2%, and 14.2% respectively [11] - The best-performing open-source model, Qwen3VL-235B-Thinking, achieved a score of 22.2%, indicating that even the largest open-source models cannot compete with top closed-source systems [12][13] Challenges in Visual Reasoning - The research identifies four core challenges faced by multi-modal large language models (MLLMs) in visual reasoning: 1. **Fine-grained Discrimination**: Difficulty in detecting subtle visual differences [19] 2. **Visual Tracking**: Inability to maintain perceptual consistency over long distances [22] 3. **Spatial Perception**: Challenges in constructing stable three-dimensional representations from two-dimensional images [28] 4. **Visual Pattern Recognition**: Struggles in generalizing rules from limited visual examples [34] Proposed Solutions - The study suggests two potential directions for improving visual reasoning capabilities: 1. **Reinforcement Learning with Verifiable Rewards (RLVR)**: This approach showed an overall accuracy improvement of approximately 4.8 percentage points after fine-tuning, particularly in fine-grained discrimination and spatial perception tasks [36] 2. **Generative Modeling**: The introduction of BabyVision-Gen evaluated three advanced visual generative models, with NanoBanana-Pro achieving the highest accuracy of 18.3% [38][39] Future Trends - The research indicates a shift towards unified architectures that bypass the "language bottleneck," allowing for high-fidelity visual representations during reasoning processes [44] - Models like Bagel, Sora 2, and Veo 3 demonstrate the potential for generative methods to serve as advanced forms of reasoning, emphasizing the importance of maintaining visual integrity in AI systems [44]
最强大模型的视觉能力不如6岁小孩
量子位· 2026-01-22 11:13
Core Insights - The current state of visual reasoning in AI models is still significantly behind human capabilities, with the best model, Gemini 3 Pro Preview, only slightly outperforming a three-year-old child and lagging 20% behind a six-year-old child [2][10] - The performance of Gemini 3 Pro Preview is noted as the highest among existing models, with a score of 49.7%, while other leading models like GPT-5.2 and Claude 4.5 Opus show even poorer results [6][14] - The article emphasizes the need for future models to rebuild visual capabilities from the ground up rather than relying on language-based translations of visual problems [11] Performance Comparison - In closed-source models, Gemini 3 Pro Preview leads with 49.7%, followed by GPT-5.2 at 34.4% and Doubao-Seed-1.8 at 30.2% [14] - Other models such as Qwen3-VL-Plus, Grok-4, and Claude-4.5-Opus scored significantly lower, indicating a general underperformance in visual reasoning tasks [15] - The best-performing open-source model, Qwen3VL-235B-Thinking, achieved a score of 22.2%, still far behind the top closed-source systems [16] Challenges in Visual Reasoning - The article identifies four core challenges faced by multi-modal large language models (MLLMs) in visual reasoning: 1. **Lack of Non-verbal Fine Details**: MLLMs struggle to accurately describe fine visual details that cannot be easily expressed in language [25] 2. **Loss of Manifold Consistency**: MLLMs often fail to maintain perceptual consistency over long distances, leading to errors in tasks involving spatial relationships [31] 3. **Spatial Imagination**: MLLMs have difficulty constructing stable three-dimensional representations from two-dimensional images, which affects their ability to perform mental transformations [39] 4. **Visual Pattern Induction**: MLLMs tend to focus on counting attributes rather than understanding the underlying changes in visual examples, limiting their ability to generalize from few examples [47] Proposed Solutions - The research suggests two potential directions to improve visual reasoning: 1. **Reinforcement Learning with Verifiable Rewards (RLVR)**: This approach showed an overall accuracy improvement of 4.8 percentage points after fine-tuning, particularly in fine-grained discrimination and spatial perception tasks [56][58] 2. **Generative Model Approaches**: The study introduces BabyVision-Gen, which evaluates generative models like NanoBanana-Pro, GPT-Image-1.5, and Qwen-Image-Edit, highlighting that while success rates are still low, some models exhibit explicit visual thinking capabilities [60][62] Future Directions - The article concludes that overcoming the "language bottleneck" in visual reasoning is crucial, advocating for unified architectures that retain high-fidelity visual representations during reasoning processes [68][70] - Models like Bagel and Sora 2 demonstrate the potential for generative methods to serve as advanced forms of reasoning, emphasizing the importance of robust visual semantic understanding [71]
速递|种子轮即达5000万美元:前谷歌、苹果研究人员创办AI初创企业
Z Potentials· 2026-01-12 03:20
Core Insights - Andrew Dai, a seasoned AI researcher with 14 years of experience, has left Google DeepMind to establish a startup named Elorian, focusing on developing AI models that can understand and process text, images, video, and audio simultaneously [1] - Elorian is currently in discussions to raise approximately $50 million in seed funding, with Striker Venture Partners, founded by former VC partner Max Gazor, considering leading this round [1] - Co-founder Yang Yinfeng, a former Apple researcher, contributed to the development of Elorian's AI models before leaving Apple in December [1] Company Focus - Elorian aims to create AI models that visualize and analyze the physical world by synchronously processing images, videos, and audio [1] - While robotics is a potential application for Elorian's AI, the company envisions a broader range of applications, although specific use cases have not been disclosed [1] Industry Trends - Early AI models, such as those from OpenAI, primarily trained on text but have shifted towards image and video training, marking a significant trend in the field known as visual reasoning [2] - Visual reasoning models are designed for complex AI applications, integrating multiple functionalities and reducing the need for developers to piece together different AI models [2] - This technology is particularly valuable for AI agents that need to interpret and understand images, supporting advanced tasks like processing retail product returns and reviewing legal documents [2] Research Background - Andrew Dai has a strong background in pre-training models, having co-led data-centric pre-training work at Google DeepMind, which laid the groundwork for the Gemini series of models [2] - He is recognized as a pioneer in the field of language models, with a focus on developing techniques to assess the quality of training data for AI models [2]
NeurIPS 2025 Spotlight | FSDrive统一VLA和世界模型,推动自动驾驶迈向视觉推理
机器之心· 2025-09-30 08:45
Core Insights - The article introduces FSDrive, a novel approach that utilizes "Spatio-Temporal Chain-of-Thought" (CoT) to enhance visual reasoning in autonomous driving, moving away from traditional symbolic logic to a more intuitive visual simulation and imagination process [7][28]. Group 1: Methodology and Innovations - FSDrive proposes a unified "visual intermediary" that replaces text or tabular mediators, effectively eliminating cross-modal semantic gaps [8]. - The method activates image generation capabilities on existing Multi-Modal Large Language Models (MLLM) with minimal cost by expanding the vocabulary to include visual tokens, avoiding major architectural changes or extensive retraining [8][19]. - A progressive visual CoT is employed, starting with coarse-grained perception maps (lane lines and 3D boxes) and gradually generating detailed future frames, explicitly injecting physical realism [8][19]. Group 2: Performance and Metrics - FSDrive demonstrates competitive performance in trajectory planning and scene understanding, achieving an average L2 error of 0.53 and a collision rate of 0.19, outperforming existing methods like UniAD [29][22]. - The quality of future frame generation is indicated by a FID score of 10.1 at a resolution of 128×192, surpassing many diffusion-based world models [22]. - In scene understanding tasks, FSDrive achieves a final score of 0.57, exceeding other recent methods, showcasing the effectiveness of its unified pre-training approach [25]. Group 3: Practical Applications and Future Directions - FSDrive maintains an end-to-end simple link and interpretable visual reasoning while leveraging large amounts of unannotated video data to learn world evolution patterns [9]. - The framework is adaptable to mainstream MLLMs, indicating its potential for broad application in the autonomous driving industry [20]. - Future developments may include expanding the model to predict a unified panoramic view while addressing safety, privacy, and regulatory compliance issues as the technology matures [30].
NeurIPS'25 Spotlight!自驾新范式FSDrive: VLA + 世界模型双管齐下(阿里&西交)
自动驾驶之心· 2025-09-21 23:32
Core Insights - The article discusses the development of a spatio-temporal Chain-of-Thought (CoT) reasoning method for Vision-Language Models (VLMs) in the autonomous driving sector, emphasizing the need for visual reasoning over symbolic logic [1][4][24] - It introduces a unified pre-training paradigm that enhances the visual generation capabilities of VLMs while maintaining their semantic understanding [6][24] Summary by Sections Introduction - Multi-modal large language models (MLLMs) have shown exceptional performance in knowledge and reasoning, leading to their application in autonomous driving [4] - The end-to-end Vision-Language-Action (VLA) model simplifies system architecture and minimizes information loss by directly generating vehicle control commands from visual observations and language instructions [4] Methodology - The spatio-temporal CoT method allows VLMs to visualize and plan trajectories by generating unified image frames that predict future states, incorporating spatial and temporal relationships [5][11] - The proposed method integrates visual cues and physical constraints to guide the model's attention towards drivable areas and key objects, enhancing trajectory planning [5][16] Pre-training Paradigm - A new pre-training approach is introduced that combines visual understanding and generation, allowing VLMs to predict future frames while adhering to physical laws [6][12] - The gradual image generation method ensures that the model first predicts coarse-grained visual cues before generating detailed future frames, maintaining physical realism [15][24] Experimental Results - Extensive experiments validate the effectiveness of the FSDrive framework in trajectory planning, future frame generation, and scene understanding, demonstrating its advancement towards visual reasoning in autonomous driving [11][24] Conclusion - FSDrive establishes an end-to-end visual reasoning pipeline that unifies future scene generation and perception results, effectively bridging the semantic gap caused by cross-modal conversions [24]
当AI成“视觉神探”,准确性如何?隐私暴露风险如何抵御?
2 1 Shi Ji Jing Ji Bao Dao· 2025-08-21 07:18
Core Insights - The article discusses the launch of the GLM-4.5V visual reasoning model by Zhipu AI, which is claimed to be the best-performing model globally with 100 billion parameters, capable of accurately identifying image details and inferring background information without relying on search tools [1][5] - The competition in visual reasoning capabilities among major AI companies, including OpenAI, Google, and domestic players like Doubao and Tongyi Qianwen, is highlighted, emphasizing the growing importance of multimodal capabilities in AI models [1][5] - Concerns regarding privacy risks associated with AI's ability to pinpoint locations from images are raised, especially in light of previous models that have sparked worries about "opening the box" [1][5][6] Model Performance Summary - In a practical test, Doubao achieved a 100% accuracy rate in identifying locations from images, while Zhipu's GLM-4.5V had a 60% accuracy rate, and Tongyi Qianwen's QVQ-Max only reached 20% [2][3] - The models were tested on five images with varying levels of identifiable landmarks, showing that typical landmark photos were easier to identify, while more ambiguous images led to varied performance among the models [3][4] - Doubao's superior performance is attributed to its ability to connect to the internet for real-time data retrieval, enhancing its accuracy in location identification [4][5] Technical Developments - The article notes that visual reasoning has become a competitive focus for AI models, with several new models being released this year, including OpenAI's o3 and o4-mini, and Google's Gemini 2.5 pro, all showcasing advanced visual reasoning capabilities [5][6] - Zhipu AI's GLM-4.5V reportedly outperformed 99% of human players in a global competition, demonstrating its advanced capabilities in inferring geographic coordinates from images [6] Privacy Concerns - The article highlights a study indicating that advanced multimodal models, including those from OpenAI and Google, pose significant privacy risks by lowering the barriers for non-experts to extract location data from social media images [6][7] - Experts suggest that AI companies should implement safety boundaries for image analysis capabilities to mitigate privacy risks, such as restricting access to sensitive data and limiting the analysis of potentially dangerous requests [7][8]
当AI成”视觉神探“,准确性如何?隐私暴露风险如何抵御?
2 1 Shi Ji Jing Ji Bao Dao· 2025-08-21 07:09
Core Insights - The article discusses the launch of the GLM-4.5V visual reasoning model by Zhiyu AI, which claims to be the best in its class with a capacity of 100 billion parameters, capable of accurately identifying image details and inferring background information without relying on search tools [1][6] - The competition in visual reasoning capabilities among major AI players, including OpenAI, Google, and domestic companies like Doubao and Tongyi Qianwen, is highlighted, emphasizing the growing importance of multimodal capabilities in AI models [1][6] - Concerns regarding privacy risks associated with AI's ability to pinpoint locations from images are raised, particularly in light of previous models that have sparked "open box" worries [1][6][7] Model Performance - In a practical test, Doubao achieved a 100% accuracy rate in identifying locations from images, while Zhiyu's GLM-4.5V had a 60% accuracy rate, and Tongyi Qianwen's QVQ-Max only reached 20% [2][3] - The models performed differently based on the clarity and type of images, with landmark photos being the easiest to identify accurately [3][4] - Doubao's superior performance is attributed to its ability to connect to the internet for real-time data comparison, enhancing its accuracy [5] Technical Developments - The article notes the rapid advancements in visual reasoning technology, with several new models being released this year, including OpenAI's o3 and o4-mini, and Google's Gemini 2.5 pro, all showcasing strong visual reasoning capabilities [6][7] - Zhiyu AI's GLM-4.5V has been tested in a global competition against top human players, demonstrating its competitive edge in visual reasoning tasks [7] Privacy Concerns - The ability of AI models to infer geographic locations from images raises significant privacy concerns, as highlighted by a study indicating that advanced multimodal models can lower the barrier for extracting user location data from social media images [7][8] - Experts recommend that AI companies implement safety boundaries for image analysis capabilities to mitigate privacy risks, such as restricting access to sensitive data like Exif information [8]
是「福尔摩斯」,也是「列文虎克」,智谱把OpenAI藏着掖着的视觉推理能力开源了
机器之心· 2025-08-12 03:10
Core Viewpoint - The article discusses the capabilities and applications of the open-source visual reasoning model GLM-4.5V, highlighting its advanced image recognition, reasoning abilities, and potential use cases in various fields [6][11][131]. Group 1: Model Capabilities - GLM-4.5V demonstrated strong visual reasoning skills by accurately identifying locations from images, outperforming 99.99% of human players in a global game [9][10]. - The model can analyze complex images and videos, providing detailed insights and summaries, which indicates its potential as a GUI agent application [10][11]. - It excels in recognizing and interpreting visual elements, even in challenging scenarios such as visual illusions and occlusions [19][20][54]. Group 2: Practical Applications - GLM-4.5V can accurately predict geographical locations from images, providing detailed location data in JSON format [21][27]. - The model's ability to read and interpret complex documents, including charts and graphs, enhances its utility for users needing local processing without cloud dependency [101][109]. - It can assist in various tasks, such as coding, video summarization, and document analysis, making it a versatile tool for developers and researchers [58][71][128]. Group 3: Technical Specifications - GLM-4.5V features 106 billion total parameters and supports 64K multi-modal long contexts, enhancing its processing capabilities [127][128]. - The model employs advanced techniques such as 2D-RoPE and 3D-RoPE for improved image and video processing, showcasing its technical sophistication [127][128]. - Its training involved a three-phase strategy, including pre-training, supervised fine-tuning, and reinforcement learning, which contributed to its state-of-the-art performance in various benchmarks [128][130]. Group 4: Industry Impact - The open-source nature of GLM-4.5V allows for greater transparency and customization, enabling developers to tailor the model to specific business needs [131][132]. - The shift from performance benchmarks to real-world applications signifies a growing emphasis on practical utility in AI development, with GLM-4.5V positioned as a foundational model for various industries [131][132]. - This model represents an opportunity for developers to collaboratively shape the future of AI, moving beyond mere competition to creating real-world value [133].
豆包悄悄上线的这个新功能,也能用眼睛推理全世界了。
数字生命卡兹克· 2025-08-07 01:05
Core Viewpoint - The article discusses the advancements in AI products, particularly focusing on the visual reasoning capabilities of the "豆包" application compared to "openai o3," highlighting its practical applications in everyday scenarios and its user-friendly nature [1][22][64]. Group 1: AI Product Comparison - "豆包" has introduced a visual reasoning feature that allows users to upload images and receive detailed analyses, showcasing its advanced capabilities [21][5]. - Unlike "openai o3," which requires payment, "豆包" offers its services for free, making it more accessible to users [22][64]. - The article emphasizes the convenience of using "豆包" in various situations, such as identifying characters or locations from images, demonstrating its practical utility [24][68]. Group 2: Practical Applications - The author shares instances where "豆包" successfully identified a restaurant from a video screenshot and recognized popular culture references, showcasing its effectiveness in real-world applications [29][41]. - "豆包" can analyze complex images and provide accurate information, even when details are not fully visible, indicating its robust analytical capabilities [37][57]. - The application also performs well in answering trivia and identifying characters from various media, reflecting its extensive knowledge base [49][51]. Group 3: User Experience - Users experience a seamless interaction with "豆包," where knowledge and insights are quickly retrieved, enhancing the overall user experience [76][77]. - The article conveys a sense of excitement about the potential of AI to facilitate knowledge acquisition and understanding in a fast-paced manner [76][77]. - The integration of AI into daily life is portrayed as a future norm, where users can expect immediate responses to their inquiries [76][77].