中金 | AI智道（9）：多模态推理技术突破，向车端场景延伸

Core Insights - The article emphasizes the significance of multimodal reasoning as a key direction for large model technology iteration in 2025, with Google leading the charge and multiple domestic achievements being reported [2] - The integration of multimodal reasoning capabilities is expected to enhance application scenarios, particularly in intelligent driving and human-like reasoning processes [3] Summary by Sections Multimodal Reasoning Developments - Google released the Gemini 2.5 model in March 2025, which supports various input types including text, images, audio, video, and code, enabling multimodal fusion reasoning [2] - Domestic companies such as Step-R1-V-Mini by Jietiao Xingchen and SenseNova V6 by SenseTime have made significant advancements in multimodal reasoning, with the latter achieving a breakthrough in understanding long videos [2] Technical Innovations - MiniMax introduced the V-Triune framework, which unifies visual reasoning and perception tasks within a reinforcement learning framework, demonstrating initial validation of scalability and generalization [3] - The V-Triune framework consists of three components: multimodal sample data formatting, asynchronous client-server architecture for reward calculation, and data source-level monitoring for stability [3] Applications in Intelligent Driving - Multimodal reasoning is becoming a focal point for leading intelligent driving companies, enhancing capabilities such as road sign recognition and complex scene generalization [3] - NIO's world model NVM, launched on May 30, 2025, showcases significant performance improvements in real-time environment understanding and decision-making for optimal lane selection and autonomous navigation [3]