Workflow
多模态大模型
icon
Search documents
商汤科技(00020)与寒武纪(688256.SH)实现多模态大模型Day 0成功适配 激发AI前沿应用创新活力
智通财经网· 2025-12-16 11:22
Core Insights - The collaboration between SenseTime and Cambricon marks a significant milestone in the development of "domestic chips + domestic models," with the successful adaptation of the Seko series model on the same day it was released [1][2] - This partnership aims to enhance the domestic AI application ecosystem, making advanced multimodal AI capabilities more accessible and cost-effective for developers and enterprises [1][3] Group 1 - Cambricon's Day 0 adaptation of SenseTime's "Riri Xin" model demonstrates the rapid response capability of domestic chips in supporting local AI manufacturers, indicating a strong collaboration within the domestic AI ecosystem [2] - The Seko series models, including SekoIDX and SekoTalk, form the core technology base for the Seko2.0 intelligent agent, showcasing the extension of the domestic AI ecosystem into more complex multimodal generation fields [2] Group 2 - The partnership focuses on creating a more efficient and user-friendly tiered product system, leveraging the LightX2V framework designed by SenseTime for rapid adaptation to various domestic hardware [3] - Innovations such as low-bit quantization, compressed communication, and sparse attention mechanisms have significantly improved inference performance by over three times, enhancing the overall efficiency and resource utilization of the models [3] - Future collaboration will involve deeper optimization efforts to lower the barriers for using multimodal AI and improve the overall user experience [3]
商汤全面出击,冲在“AI 国产化”第一线
远川研究所· 2025-12-15 13:08
如果说 2025 年年初最火爆的 AI 话题是 DeepSeek R1,那么近日横空出世的「摩尔线程 IPO」则为 2025 年的结束锦上添花,二者共同为中国 AI 科技创新开龙摆尾,向业内乃至国际传 递了一个有力的 信号: 中国的人工智能,从底层算力到上层模型,如果有需要,是有能力摆脱对海外技术的依赖、实现全国产 自主自研的。 尽管这只是初具雏形——以 AI 芯片为例,华为、商汤等许多企业的实践都已证明,国产芯片的性能较 之英伟达等寡头虽有不足,但通过软硬协同的生态模式,也能用于 AI 大模型的训练与推理。 摩尔线程于 12 月 5 日上市,作为「国产 GPU 第一股」,其上市首日股价就暴涨超 400%,上市五天 后市值更进一步飙升至约 4500 亿元、较发行时市值增长超过 7 倍——虽然市场占有率与技术先进性 相比英伟达尤有不足,但摩尔线程 IPO 的这一亮眼成绩,恰恰表明了市场在用脚投票,看好中国科技自 主创新的未来。 不过,DeepSeek 代表的是国产 AI 在模型层的自主创新,而非从算力集群层与模型层的系统国产化创 新。早前就有媒体报道,DeepSeek-R1 内部用于模型训练与推理的算力集群主要 ...
基于Qwen3-VL的自动驾驶场景实测......
自动驾驶之心· 2025-12-12 07:35
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 近年来,多模态大模型在自动驾驶领域的潜力逐渐显现。它们能否真正"看懂"路况、理解交通行为、甚至预测 风险,成为行业内外关注的焦点。 笔者对近期阿里通义最新的 Qwen3-VL 模型进行了一系列自动驾驶场景的实测,涵盖 场景理解、空间推理、 行为判断、风险预测 等多个维度。 个人认为, Qwen3-VL不仅在基础感知任务上表现稳健,更在开放式推理与动态场景理解中展现出令人惊喜 的"老司机"潜质 。 更重要的是, 它并未经过专门的自动驾驶指令微调(SFT) ,却能对复杂交通场景做出合理、连贯、甚至带 有"安全意识"的判断——这让我们看到了通用视觉语言模型在垂直领域中落地的更多可能。 本次测试选取了 CoVLA 基准中的部分图像,以及基准中的一些中翻后的问题。此外笔者也自拟了一些开放式 问题。 一起来看看吧!更多关于自动驾驶的技术解析、行业动态和业内交流, 欢迎加入自动驾驶之心知识星球,超过4000的人自驾社区...... 场景理解和空间推理 示例1 :简单描述一下这张图片。 :图片中的天气如何? :车辆正行驶在哪 ...
车企集体跨界智能终端AI入口争夺战中开启生态破局
Core Insights - The launch of Li Auto's AI glasses, Livis, signifies a shift in the competitive landscape of the new energy vehicle (NEV) industry, expanding from single automotive products to a comprehensive smart ecosystem [1][2] - Major Chinese NEV companies, including NIO and Xpeng, are diversifying into smart wearables and digital products to break through homogenized competition and capture user engagement in the AI era [1][2] Group 1: Product Features and Market Strategy - Livis features a lightweight design at 36 grams, an impressive battery life of 18.8 hours, and deep integration with vehicle systems, enhancing user interaction and experience [1][2] - The glasses are not standalone hardware but part of a "car + glasses" ecosystem, which increases user stickiness by transitioning customers from car buyers to participants in a smart lifestyle [2][3] Group 2: Industry Trends and Competitive Landscape - The global smart glasses market is experiencing significant growth, with a 64.2% year-on-year increase in shipments, and NEV-related products contributing 15% to this growth [3] - Li Auto's Livis achieved over 12,000 orders on its first sales day, with 80% of orders coming from existing Li Auto vehicle owners, indicating a shift towards an ecosystem competition involving multiple smart terminals [3] Group 3: Business Model and Challenges - Domestic NEV companies face longer profit cycles in their smart device ventures compared to international competitors like Tesla, which has successfully integrated robotics into its business model [4] - The development of smart devices by companies like Xpeng and Li Auto requires substantial investment and innovative business models to ensure long-term cash flow and profitability [4][5] Group 4: Consumer Behavior and Market Potential - A significant 72% of Chinese smart device users are willing to pay a premium for cross-terminal services, compared to only 45% in Western markets, indicating a strong market potential for innovative cross-device solutions [4] - The strategy of extending smart cockpit ecosystems to personal devices is seen as a necessary response to industry homogenization and a way to secure a foothold in the AI-driven market [5]
南大联合LibLib.ai、中科院自动化所,共同提出布局推理与精准编辑「海报设计大模型」PosterCopilot
机器之心· 2025-12-10 08:13
Core Viewpoint - The article discusses the development of PosterCopilot, a professional-level poster design and editing model that addresses significant challenges in graphic design automation, particularly in layout reasoning and controllable editing [2][6][40]. Industry Pain Points - Graphic design faces substantial challenges in achieving true automation, with existing models like Stable Diffusion struggling with layered structures, leading to material distortion and lack of fine control [6]. - Current multimodal models exhibit four critical shortcomings: severe element overlap, lack of visual feedback, regression to a single ground truth, and inability to perform layer-specific edits [8][10]. Core Achievements - PosterCopilot aims to bridge the gap between single-step generation and professional workflows through a systematic solution that incorporates a three-stage training strategy [13][14]. - The innovative three-stage training includes: 1. Perturbation Supervised Fine-Tuning (PSFT) to address geometric distortions [15]. 2. Visual-Reality Alignment Reinforcement Learning (RL-VRA) to correct overlaps and proportional issues [15]. 3. Aesthetic Feedback Reinforcement Learning (RLAF) to encourage exploration beyond ground truth layouts [15]. Generative Agent - PosterCopilot functions as a comprehensive design assistant, facilitating seamless transitions from abstract design concepts to concrete materials through a reception model and T2I model [16][17]. - The model supports various professional scenarios, including full poster generation from provided assets, intelligent completion of missing materials, global theme transitions, intelligent size reconstruction, and multi-round fine-grained editing [21][23][28][29][31]. Experimental Results - PosterCopilot outperforms existing commercial competitors and state-of-the-art models across multiple metrics, achieving an average win rate exceeding 74% in human evaluations [34][35]. - In assessments of layout rationality, text legibility, and element preservation, PosterCopilot demonstrates superior performance compared to models like Microsoft Designer and CreatiPoster [35][37]. Conclusion and Outlook - By decoupling layout reasoning from generative editing and incorporating reinforcement learning to align with human aesthetics, PosterCopilot sets a new benchmark for intelligent design tools and offers a new paradigm for AI-assisted creative workflows [40].
智谱上线并开源GLM-4.6V系列多模态大模型 构建原生多模态工具调用能力
Zheng Quan Ri Bao Wang· 2025-12-09 10:46
本报讯 (记者梁傲男)12月8日,北京智谱华章科技股份有限公司(以下简称"智谱")正式上线并开源 GLM-4.6V系列多模态大模型,包括面向云端与高性能集群场景的基础版GLM-4.6V(106B-A12B)和面 向本地部署与低延迟应用的轻量版GLM-4.6V-Flash(9B)。 智谱方面表示:"智谱多模态开源周开启,我们将持续开源更多前沿模型。拥抱多模态交互新范式,从 GLM-4.6V开始。" 传统工具调用大多基于纯文本,在面对图像、视频、复杂文档等多模态内容时,需要多次中间转换,带 来信息损失和工程复杂度。 据了解,GLM-4.6V从设计之初就围绕"图像即参数,结果即上下文",构建了原生多模态工具调用能 力:图像、截图、文档页面等可以直接作为工具参数,无需先转为文字描述再解析,减少链路损耗。对 于工具返回的统计图表、渲染后网页截图、检索到的商品图片等结果,模型能够再次进行视觉理解,将 其纳入后续推理链路。 模型原生支持基于视觉输入的工具调用,完整打通从感知到理解到执行的闭环。这使得GLM-4.6V能够 应对图文混排输出、商品识别与好价推荐以及辅助型Agent场景等更复杂的视觉任务。 据介绍,GLM-4.6 ...
全图与切片并非等价?LLaVA-UHD-v3揭示差异推出高效全图建模方案
机器之心· 2025-12-09 03:17
随着多模态大模型(MLLMs)在各类视觉语言任务中展现出强大的理解与交互能力,如何高效地处理原生高分辨率图像以捕捉精细的视觉信息,已成为提升模型 性能的关键方向。 然而,主流的视觉编码范式往往难以兼顾性能与效率:基于切片的编码方法虽能降低计算开销,却牺牲了全局上下文感知能力;而全局原生分辨率编码在提升整 体性能的同时,又带来了巨大的计算负担。同时,现有的视觉压缩策略与特征提取过程相对独立,难以在编码早期有效控制信息冗余,缺乏一个兼顾细粒度建模 与计算效率的统一架构。 针对如何在高清原生分辨率下,保持图像全局理解能力的同时,还能快速推理这一核心问题,来自清华大学、中科院的研究团队正式发布 LLaVA-UHD v3 ! LLaVA-UHD-v3 提出了全新的渐进式视觉压缩框架 —— Progressive Visual Compression(PVC) ,由 Refined Patch Embedding(RPE) 与 Windowed Token Compression(WTC) 两个核心组件构成。该框架在保持全局语义一致性的前提下,显著减少视觉 Token 数量,从根本上提升原生高分辨率视觉编码的效率。依 论 ...
智谱上线并开源 GLM-4.6V 系列多模态大模型
Bei Jing Shang Bao· 2025-12-08 12:34
北京商报讯(记者 魏蔚)12月8日,智谱正式上线并开源 GLM-4.6V 系列多模态大模型,包括面向云端 与高性能集群场景的基础版GLM-4.6V(106B-A12B)和面向本地部署与低延迟应用的轻量版GLM- 4.6V-Flash(9B)。 据介绍,GLM-4.6V 将训练时上下文窗口提升到 128k tokens,在视觉理解精度上达到同参数规模 SOTA,首次在模型架构中将 Function Call(工具调用)能力原生融入视觉模型,打通从"视觉感 知"到"可执行行动(Action)"的链路,为真实业务场景中的多模态 Agent (智能体)提供统一的技术底 座。该系列模型较GLM-4.5V 降价 50%,API (应用程序编程接口)调用价格为输入 1 元/百万 tokens, 输出 3 元/百万 tokensm,其中GLM-4.6V-Flash 免费供用户使用。GLM-4.6V 融入 GLM Coding Plan,针 对用户 8 类场景定向开发了专用 MCP(大模型上下文协议) 工具。 ...
死磕技术的自动驾驶黄埔军校,又更新了这些技术进展......
自动驾驶之心· 2025-12-07 02:05
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 这一个月,自动驾驶之心星球又更新了很多技术内容,汇报给大家: 自动驾驶之心知识星球是我们一直在维护更新的星球! 如果您也想和自动驾驶学术界或工业界的大佬交流,欢迎加入自动驾驶之心知识星球。我们聊技术、聊趋 势、聊变化。未来柱哥还会持续邀请学术界和工业界的同行和大家交流。 欢迎加入自动驾驶之心知识星球,我们准备的大额的新人优惠...... 扛内卷,一个足够有料的社区 对于很多想入门的同学来说,试错成本有点高。没时间和缺乏完整的体系是最大问题,这也容易导致行业壁垒越来越高,如果想要卷赢那就更加困难了。 所以我们联合了诸多学术界和工业界的大佬,共同打造了我们维护三年之久的『自动驾驶之心知识星球』! 星球目前集视频 + 图文 + 学习路线 + 问答 + 求职交流 为一体,是一个综合类的自驾社区,已经超过4000人了。我们期望未来2年内做到近万人的规模。给大家打造一个交流+技术分享的聚集地,是许多初学者和进阶的 同学经常逛的地方。 如果你也想和我们一起推动自驾领域的进步,欢迎加入我们的社区团队,和我们一起推动! 我们准 ...
Ilya刚预言完,世界首个原生多模态架构NEO就来了:视觉和语言彻底被焊死
3 6 Ke· 2025-12-05 07:06
Core Insights - The AI industry is experiencing a paradigm shift as experts like Ilya Sutskever declare that the era of merely scaling models is over, emphasizing the need for smarter architectures rather than just larger models [1][26] - A new native multimodal architecture called NEO has emerged from a Chinese research team, which aims to fundamentally disrupt the current modular approach to AI models [1][5] Group 1: Current State of Multimodal Models - Traditional multimodal models, such as GPT-4V and Claude 3.5, primarily rely on a modular approach that connects pre-trained visual encoders to language models, resulting in a lack of deep integration between visual and language processing [3][6] - The existing modular models face three significant technical gaps: efficiency, capability, and fusion, which hinder their performance in complex tasks [6][7][8] Group 2: NEO's Innovations - NEO introduces a unified model that integrates visual and language processing from the ground up, eliminating the distinction between visual and language modules [8][24] - The architecture features three core innovations: Native Patch Embedding, Native-RoPE for spatial encoding, and Native Multi-Head Attention, which enhance the model's ability to understand and process multimodal information [11][14][16] Group 3: Performance Metrics - NEO demonstrates remarkable data efficiency, achieving comparable or superior performance to leading models while using only 3.9 million image-text pairs for training, which is one-tenth of what other top models require [19][20] - In various benchmark tests, NEO has outperformed other native vision-language models, showcasing its superior performance across multiple tasks [21][22] Group 4: Implications for the Industry - NEO's architecture not only improves performance but also lowers the barriers for deploying multimodal AI in edge devices, making advanced visual understanding capabilities accessible beyond cloud-based models [23][24] - The open-sourcing of NEO's architecture signals a shift in the AI community towards more efficient and unified models, potentially setting a new standard for multimodal technology [24][25]