Workflow
视觉推理
icon
Search documents
NeurIPS 2025 Spotlight | FSDrive统一VLA和世界模型,推动自动驾驶迈向视觉推理
机器之心· 2025-09-30 08:45
面向自动驾驶的多模态大模型在 "推理链" 上多以文字或符号为中介,易造成空间 - 时间关系模糊与细粒度信息丢失。FSDrive(FutureSightDrive)提出 "时空视觉 CoT"(Spatio-Temporal Chain-of-Thought),让模型直接 "以图思考",用统一的未来图像帧作为中间推理步骤,联合未来场景与感知结果进行可视化推理。该方法 在不改动原有 MLLM 架构的前提下,通过 "词表扩展 + 自回归视觉生成" 激活图像生成能力,并以 "由易到难" 的渐进式视觉 CoT 注入物理先验。模型既充当 "世 界模型" 预测未来,又作为 "逆动力学模型" 进行轨迹规划。 多模态大语言模型(MLLM)凭借世界知识与可解释推理能力,正加速进入端到端 "视觉 - 语言 - 动作"(VLA)自动驾驶范式。但现有做法多依赖离散文本 CoT (如规则描述、坐标),本质上是对视觉信息的高度符号压缩,存在跨模态语义鸿沟与时空关系表征不足的问题。 项目主页:https://miv-xjtu.github.io/FSDrive.github.io/ 论文链地址:https://arxiv.org/abs/25 ...
NeurIPS'25 Spotlight!自驾新范式FSDrive: VLA + 世界模型双管齐下(阿里&西交)
自动驾驶之心· 2025-09-21 23:32
视觉语言模型(VLMs)因其强大的推理能力,在自动驾驶领域受到日益广泛的关注。然而,现有VLMs通常采用针对特定场景设计的离散文本思维链(Chain-of-Thought, CoT),这种表征本质上是对视觉信息的高度抽象和符号化压缩,可能导致空间-时间关系模糊以及细粒度信息丢失。自动驾驶是否更适合通过真实世界模拟与想象建模, 而非纯粹依赖符号逻辑?本文提出一种时空思维链(spatio-temporal CoT)推理方法,使模型可视化地思考。 首先VLM作为世界模型生成统一图像帧以预测未来世界状态:其中感知结果(如车道分隔线和3D检测)表征未来的空间关系,普通未来帧则表征时间演化的动态关系。 该时空思维链作为中间推理步骤,使VLM能够充当逆动力学模型,基于当前观测和未来预测进行轨迹规划。为实现VLM的视觉生成能力,提出了统一视觉生成与理解的 预训练范式,并设计渐进式生成过程增强自回归图像生成方法。大量实验结果验证了该方法的有效性,推动自动驾驶迈向视觉推理。 项目链接:https://miv-xjtu.github.io/FSDrive.github.io/ 论文链接:https://arxiv.org/abs/ ...
当AI成“视觉神探”,准确性如何?隐私暴露风险如何抵御?
Core Insights - The article discusses the launch of the GLM-4.5V visual reasoning model by Zhipu AI, which is claimed to be the best-performing model globally with 100 billion parameters, capable of accurately identifying image details and inferring background information without relying on search tools [1][5] - The competition in visual reasoning capabilities among major AI companies, including OpenAI, Google, and domestic players like Doubao and Tongyi Qianwen, is highlighted, emphasizing the growing importance of multimodal capabilities in AI models [1][5] - Concerns regarding privacy risks associated with AI's ability to pinpoint locations from images are raised, especially in light of previous models that have sparked worries about "opening the box" [1][5][6] Model Performance Summary - In a practical test, Doubao achieved a 100% accuracy rate in identifying locations from images, while Zhipu's GLM-4.5V had a 60% accuracy rate, and Tongyi Qianwen's QVQ-Max only reached 20% [2][3] - The models were tested on five images with varying levels of identifiable landmarks, showing that typical landmark photos were easier to identify, while more ambiguous images led to varied performance among the models [3][4] - Doubao's superior performance is attributed to its ability to connect to the internet for real-time data retrieval, enhancing its accuracy in location identification [4][5] Technical Developments - The article notes that visual reasoning has become a competitive focus for AI models, with several new models being released this year, including OpenAI's o3 and o4-mini, and Google's Gemini 2.5 pro, all showcasing advanced visual reasoning capabilities [5][6] - Zhipu AI's GLM-4.5V reportedly outperformed 99% of human players in a global competition, demonstrating its advanced capabilities in inferring geographic coordinates from images [6] Privacy Concerns - The article highlights a study indicating that advanced multimodal models, including those from OpenAI and Google, pose significant privacy risks by lowering the barriers for non-experts to extract location data from social media images [6][7] - Experts suggest that AI companies should implement safety boundaries for image analysis capabilities to mitigate privacy risks, such as restricting access to sensitive data and limiting the analysis of potentially dangerous requests [7][8]
当AI成”视觉神探“,准确性如何?隐私暴露风险如何抵御?
Core Insights - The article discusses the launch of the GLM-4.5V visual reasoning model by Zhiyu AI, which claims to be the best in its class with a capacity of 100 billion parameters, capable of accurately identifying image details and inferring background information without relying on search tools [1][6] - The competition in visual reasoning capabilities among major AI players, including OpenAI, Google, and domestic companies like Doubao and Tongyi Qianwen, is highlighted, emphasizing the growing importance of multimodal capabilities in AI models [1][6] - Concerns regarding privacy risks associated with AI's ability to pinpoint locations from images are raised, particularly in light of previous models that have sparked "open box" worries [1][6][7] Model Performance - In a practical test, Doubao achieved a 100% accuracy rate in identifying locations from images, while Zhiyu's GLM-4.5V had a 60% accuracy rate, and Tongyi Qianwen's QVQ-Max only reached 20% [2][3] - The models performed differently based on the clarity and type of images, with landmark photos being the easiest to identify accurately [3][4] - Doubao's superior performance is attributed to its ability to connect to the internet for real-time data comparison, enhancing its accuracy [5] Technical Developments - The article notes the rapid advancements in visual reasoning technology, with several new models being released this year, including OpenAI's o3 and o4-mini, and Google's Gemini 2.5 pro, all showcasing strong visual reasoning capabilities [6][7] - Zhiyu AI's GLM-4.5V has been tested in a global competition against top human players, demonstrating its competitive edge in visual reasoning tasks [7] Privacy Concerns - The ability of AI models to infer geographic locations from images raises significant privacy concerns, as highlighted by a study indicating that advanced multimodal models can lower the barrier for extracting user location data from social media images [7][8] - Experts recommend that AI companies implement safety boundaries for image analysis capabilities to mitigate privacy risks, such as restricting access to sensitive data like Exif information [8]
是「福尔摩斯」,也是「列文虎克」,智谱把OpenAI藏着掖着的视觉推理能力开源了
机器之心· 2025-08-12 03:10
Core Viewpoint - The article discusses the capabilities and applications of the open-source visual reasoning model GLM-4.5V, highlighting its advanced image recognition, reasoning abilities, and potential use cases in various fields [6][11][131]. Group 1: Model Capabilities - GLM-4.5V demonstrated strong visual reasoning skills by accurately identifying locations from images, outperforming 99.99% of human players in a global game [9][10]. - The model can analyze complex images and videos, providing detailed insights and summaries, which indicates its potential as a GUI agent application [10][11]. - It excels in recognizing and interpreting visual elements, even in challenging scenarios such as visual illusions and occlusions [19][20][54]. Group 2: Practical Applications - GLM-4.5V can accurately predict geographical locations from images, providing detailed location data in JSON format [21][27]. - The model's ability to read and interpret complex documents, including charts and graphs, enhances its utility for users needing local processing without cloud dependency [101][109]. - It can assist in various tasks, such as coding, video summarization, and document analysis, making it a versatile tool for developers and researchers [58][71][128]. Group 3: Technical Specifications - GLM-4.5V features 106 billion total parameters and supports 64K multi-modal long contexts, enhancing its processing capabilities [127][128]. - The model employs advanced techniques such as 2D-RoPE and 3D-RoPE for improved image and video processing, showcasing its technical sophistication [127][128]. - Its training involved a three-phase strategy, including pre-training, supervised fine-tuning, and reinforcement learning, which contributed to its state-of-the-art performance in various benchmarks [128][130]. Group 4: Industry Impact - The open-source nature of GLM-4.5V allows for greater transparency and customization, enabling developers to tailor the model to specific business needs [131][132]. - The shift from performance benchmarks to real-world applications signifies a growing emphasis on practical utility in AI development, with GLM-4.5V positioned as a foundational model for various industries [131][132]. - This model represents an opportunity for developers to collaboratively shape the future of AI, moving beyond mere competition to creating real-world value [133].
豆包悄悄上线的这个新功能,也能用眼睛推理全世界了。
数字生命卡兹克· 2025-08-07 01:05
Core Viewpoint - The article discusses the advancements in AI products, particularly focusing on the visual reasoning capabilities of the "豆包" application compared to "openai o3," highlighting its practical applications in everyday scenarios and its user-friendly nature [1][22][64]. Group 1: AI Product Comparison - "豆包" has introduced a visual reasoning feature that allows users to upload images and receive detailed analyses, showcasing its advanced capabilities [21][5]. - Unlike "openai o3," which requires payment, "豆包" offers its services for free, making it more accessible to users [22][64]. - The article emphasizes the convenience of using "豆包" in various situations, such as identifying characters or locations from images, demonstrating its practical utility [24][68]. Group 2: Practical Applications - The author shares instances where "豆包" successfully identified a restaurant from a video screenshot and recognized popular culture references, showcasing its effectiveness in real-world applications [29][41]. - "豆包" can analyze complex images and provide accurate information, even when details are not fully visible, indicating its robust analytical capabilities [37][57]. - The application also performs well in answering trivia and identifying characters from various media, reflecting its extensive knowledge base [49][51]. Group 3: User Experience - Users experience a seamless interaction with "豆包," where knowledge and insights are quickly retrieved, enhancing the overall user experience [76][77]. - The article conveys a sense of excitement about the potential of AI to facilitate knowledge acquisition and understanding in a fast-paced manner [76][77]. - The integration of AI into daily life is portrayed as a future norm, where users can expect immediate responses to their inquiries [76][77].
o3出圈玩法“看图猜位置”,豆包也安排上了!还是人人免费用那种
量子位· 2025-07-30 06:06
Core Viewpoint - The article discusses the new visual reasoning feature of the Doubao APP, which enhances its ability to analyze images and provide contextual information, making it a versatile tool for users [1][4][66]. Group 1: Doubao APP Features - Doubao APP has upgraded its visual reasoning capabilities, allowing it to analyze images and provide detailed contextual information, such as identifying locations and historical timelines [4][8]. - The app can perform image searches and utilize various image analysis tools (zooming, cropping, rotating) to derive conclusions from images [7][50]. - Users can easily engage with the app by uploading images or taking photos to receive instant analysis and information [5][26]. Group 2: Practical Applications - Doubao APP can assist users in identifying objects or details within images, such as distinguishing between AI-generated and real images [11][20]. - The app can also help with educational tasks, such as solving complex math problems, and has been validated against human solutions [40][43]. - It can extract structured data from financial reports and other documents, enhancing productivity in both personal and professional contexts [46][49]. Group 3: Industry Trends - The article highlights a broader trend in the industry towards visual reasoning capabilities, with major models like OpenAI's o3 and o4-mini leading the charge [68][70]. - The development of multi-modal technologies supports the integration of visual reasoning into various applications, addressing both industry needs and user demands [72][75]. - The increasing prevalence of mixed media information necessitates advanced visual reasoning capabilities to improve information processing and understanding [76].
智谱再获10亿融资,推出会看“苏超”的开源新模型
Guan Cha Zhe Wang· 2025-07-03 10:30
Core Insights - The article highlights the recent advancements by Zhipu AI in the field of artificial intelligence, particularly the launch of the new visual language model GLM-4.1V-Thinking, which enhances reasoning capabilities and supports multimodal inputs including images and videos [1][7][10] - Zhipu AI has secured a strategic investment of 1 billion yuan to bolster its operations in Shanghai and contribute to the development of a supercomputing resource pool known as the "Ten Thousand Card Cluster" [3][5] - The company is focusing on commercializing its AI models, with significant increases in daily token usage and revenue, indicating a growing demand for AI applications across various industries [12][14] Group 1: Product Development - Zhipu AI introduced the GLM-4.1V-Thinking model, which supports complex cognitive tasks and has shown superior performance in various benchmarks compared to larger models [7][8][10] - The model's capabilities include understanding dynamic video content and performing reasoning tasks, which expands its application potential in real-world scenarios [9][11] - The lightweight version, GLM-4.1V-9B-Thinking, has achieved outstanding benchmark scores, demonstrating the potential of smaller models to perform at high levels [8][10] Group 2: Strategic Investments and Collaborations - Zhipu AI has completed its 16th financing round, securing a total of 1 billion yuan from strategic investors, which will support its growth in the AI sector [3][5] - The company is collaborating with Shanghai's state-owned enterprises to develop a new AI infrastructure that integrates energy, computing power, and AI models [5][6] - The "Ten Thousand Card Cluster" aims to create a supercomputing resource pool to meet the increasing demand for AI computational power in various industries [5][6] Group 3: Commercialization Efforts - Zhipu AI's daily token usage has increased nearly 30 times year-on-year, with a 52% rise in daily expenditure, reflecting the growing adoption of its AI solutions [12][14] - The company has significantly reduced API prices, with some models seeing price cuts of up to 90%, making AI services more accessible [14][15] - Zhipu AI is focusing on providing agent capabilities to businesses, allowing them to integrate AI without the need for extensive in-house development [15][16]
大模型角力视觉推理,推理AI新时代来临
Core Insights - The article discusses the advancements in visual reasoning capabilities of AI, particularly through the launch of the GLM-4.1V-Thinking model by Zhiyu, which integrates visual understanding with reasoning abilities [1][3][4] - The competition in the AI industry is intensifying as various companies, including OpenAI and ByteDance, are also developing models with visual reasoning capabilities [1][3] - The potential applications of visual reasoning in AI span across various fields, including education, healthcare, and enterprise services, indicating a shift towards commercial viability [6][7] Group 1: Model Capabilities - The GLM-4.1V-Thinking model supports multi-modal inputs, allowing it to process images, videos, and documents for complex cognitive tasks [1][3] - Visual reasoning enables the model to understand and extract information from visual elements in documents, such as PDFs, enhancing structured information extraction [3][4] - The model can perform tasks requiring both visual and textual understanding, such as solving geometric problems and analyzing video content [3][4] Group 2: Commercialization and Applications - AI companies are seeking to transform visual reasoning capabilities into digital productivity, targeting B2B clients with agent applications that simplify access to AI capabilities [6][7] - The integration of visual reasoning with tools like Python data analysis and image generation can solve complex problems and enhance user experiences [4][6] - The emergence of autonomous intelligent agents is expected to create new business models, as AI evolves from merely executing commands to actively planning and completing complex tasks [7][8] Group 3: Future Developments - The article highlights the potential for AI capabilities to be integrated into smart hardware, moving from cloud-based solutions to edge computing [8][9] - Future applications of AI are anticipated to extend to various devices, including robots, cars, and smart glasses, indicating a broader adoption of AI technologies [9]
OpenAI深夜上线o3满血版和o4 mini - 依旧领先。
数字生命卡兹克· 2025-04-16 20:34
晚上1点,OpenAI的直播如约而至。 其实在预告的时候,几乎已经等于明示了。 这块大概解释一下,别看底下模型那么多,乱七八糟,各种变体。 但是从最早的o1到如今的o3和o4‑mini,核心差别就在于模型规模、推理能力和插件工具的接入。 没有废话,今天发布的就是o3和o4-mini。 但是奥特曼这个老骗子,之前明明说o3不打算单独发布要融到GPT-5里面一起发,结果今天又发了。。。 ChatGPT Plus、Pro和Team用户从今天开始将在模型选择器中看到o3、o4-mini和o4-mini-high,取代o1、o3-mini和o3-mini-high。 我的已经变了,但是我最想要的o3 pro,还要几周才能提供,就很可惜,现在o1 pro被折叠到了更多模型里。 说实话纯粹的模型参数的进步,其实已经没啥可说的了,这次最让我觉得最大的进步点,是两个: 1. 满血版的o3终于可以使用工具了。 2. o3和o4-mini 是o系列中最新的视觉推理模型,第一次能够在思维链中思考图像了。 照例,我一个一个来说,尽可能给大家一个,非常全面完整的总结。 一.o3和o4-mini性能 其实没有特别多的意思,就跟现在数码圈一 ...