Workflow
多模态搜索推理
icon
Search documents
商汤开源SenseNova-MARS,突破多模态搜索推理天花板
Cai Jing Wang· 2026-02-03 09:19
Core Insights - SenseNova-MARS, a multimodal autonomous inference model developed by SenseTime, has achieved a state-of-the-art (SOTA) score of 69.74 in core benchmark tests, surpassing Gemini-3-Pro (69.06) and GPT-5.2 (67.64) [1][2] - This model is the first to support dynamic visual reasoning and deep integration of text-image search, enabling it to autonomously plan steps and utilize tools for complex tasks, demonstrating true "execution capability" [1][6] Performance Metrics - In the MMSearch benchmark, SenseNova-MARS scored 74.27, outperforming GPT-5.2, which scored 66.08 [4] - In the HR-MMSearch evaluation, it achieved a score of 54.43, significantly widening the gap with proprietary models [4] - The model's average scores across various benchmarks include: - MMSearch: 74.27 - HR-MMSearch: 54.43 - FVQA-test: 72.61 - SimpleVQA: 65.25 - LiveVQA: 74.14 [5] Application and Functionality - SenseNova-MARS can autonomously handle complex tasks requiring "multi-step reasoning + multi-tool collaboration," such as identifying minute details in images and retrieving relevant information [7][15] - It can execute tasks like recognizing small logos on racing suits, querying company founding years, and matching driver birthdates, all without human intervention [9] - The model is capable of analyzing images from industry events to gather information about companies and products, enhancing industry analysis [10][12] Training Methodology - The training process involves a two-phase approach: - Phase one focuses on foundational skills, utilizing a data synthesis engine to create complex multi-hop reasoning chains and ensuring logical consistency [16] - Phase two employs reinforcement learning to help the model accumulate experience and develop an intuitive understanding of tool usage [17]