Workflow
视觉语言模型
icon
Search documents
中金《秒懂研报》 | 智能驾驶:引领出行变革的新时代
中金点睛· 2025-05-24 08:32
Group 1: Core Viewpoints - The article discusses the rapid development and potential of intelligent driving technology, highlighting its transformative impact on urban mobility and the automotive industry [1][2][3]. Group 2: Technology Engine Behind Intelligent Driving - The end-to-end architecture is a significant innovation in intelligent driving, reducing data annotation difficulty and optimizing data processing through unique algorithms, which enhances vehicle responsiveness to road conditions [2][3]. - The introduction of visual language models and cloud models improves the system's ability to handle complex scenarios, akin to equipping vehicles with sharper "eyes" [3]. Group 3: Current Development of Intelligent Driving - The high-speed Navigation on Autopilot (NOA) feature is expected to be scaled up in 2024, becoming a standard for intelligent driving vehicles priced above 200,000 yuan [5]. - The penetration rate of urban NOA is projected to reach 6.5% in 2024, driven by increased consumer acceptance and reduced costs, expanding its availability to more consumers [7]. Group 4: Business Model of Intelligent Driving - The L2++ intelligent driving software faces challenges in charging fees due to low consumer willingness to pay, leading most automakers to standardize systems to accumulate users and data [11]. - Some leading automakers are exploring buyout or subscription payment models, with promotional activities to attract customers [11][12]. Group 5: Benefits of Urban NOA - Urban NOA is expected to drive sales of high-configured, high-margin models, as consumers are likely to prefer higher-end vehicles once the technology gains market acceptance [13][14]. - The overlap in technology requirements between Robotaxi and urban NOA is anticipated to enhance intelligent driving system capabilities, potentially leading to a shift towards mobility services by 2025 [15]. Group 6: Globalization of Intelligent Driving Industry - China's late start in intelligent driving is countered by rapid development, with domestic companies gaining advantages in technology and production experience, positioning them favorably in the global market [16]. - Collaborations between joint venture automakers and domestic intelligent driving companies are expected to facilitate access to international projects and opportunities for global expansion [16][17].
智能辅助驾驶竞速与暗战:自研派VS合作派,功能水平分化加剧
Bei Ke Cai Jing· 2025-05-22 10:37
Core Insights - The article discusses the advancements and competitive landscape of the assisted driving industry, highlighting various companies' self-developed systems and strategies [1][4]. Group 1: Company Developments - Li Auto has launched its new generation dual-system intelligent driving solution, focusing on upgrading driving capabilities and synchronizing updates for smart electric vehicles [3]. - NIO's intelligent assisted driving system has reportedly avoided over 3.5 million collision risks, accumulating a total driving mileage of approximately 4.94 billion kilometers as of May 15, 2025 [3]. - Chery's Hawk 500 has achieved widespread adoption of assisted driving features, with the Hawk 700 targeting mid-to-high-end models and the Hawk 900 positioned as a flagship [3]. - GAC Group's GSD intelligent driving assistance system has accumulated 5 million user driving scenarios and over 40 million kilometers of high-level autonomous driving data [3]. Group 2: Industry Trends - BYD and XPeng are recognized as leaders in self-developed intelligent driving systems, with BYD's high-end system named "Tianshen Eye" [4]. - Bosch's China president has expressed skepticism about the self-development model, suggesting that mid-level intelligent driving should become standard and that costs could be better managed through supply chain partnerships [4]. - Huawei is positioned as a top player in the intelligent driving system market, with plans for 10 brands from 7 automakers to adopt its solutions, potentially exceeding 500,000 vehicles [4][5]. - Huawei's collaboration models include component supply, Huawei Inside (HI) partnerships, and deep cooperation with automakers, with the latter being the most integrated approach [5]. Group 3: Strategic Partnerships - SAIC Group has publicly stated its intention to maintain control over core technologies while also choosing to collaborate with Huawei [6]. - The partnerships with Huawei have led to increased sales for collaborating automakers, but questions remain about their ability to independently develop high-quality vehicles [6].
85倍速度碾压:苹果开源FastVLM,能在iphone直接运行的视觉语言模型
机器之心· 2025-05-16 16:31
| 机器之心报道 | | --- | FastVLM—— 让苹果手机拥有极速视觉理解能力 当你用苹果手机随手拍图问 AI:「这是什么?」,背后的 FastVLM 模型正在默默解码。 最近,苹果开源了一个能在 iPhone 上直接运行的高效视觉语言模型 ——FastVLM(Fast Vision Language Model)。 代码链接: https://github.com/apple/ml-fastvlm 代码仓库中还包括一个基于 MLX 框架的 iOS/macOS 演示应用,优化了在苹果设备上的运行性能。 看这个 demo,反应速度是不是反应非常「Fast」!这就是 FastVLM 的独特之处。 相较于传统模型,FastVLM 模型专门注重于解决 体积、速度 这两大问题,速度快到相对同类模型, 首个 token 输出速度提升 85 倍 。 该模型引入了一种新型混合视觉编码器 FastViTHD ,融合了卷积层和 Transformer 模块,配合多尺度池化和下采样技术,把图片处理所需的「视觉 token」数量砍 到极低 —— 比传统 ViT 少 16 倍,比 FastViT 少 4 倍。它以卓越的速度和 ...
百模竞发的 365 天:Hugging Face 年度回顾揭示 VLM 能力曲线与拐点 | Jinqiu Select
锦秋集· 2025-05-16 15:42
Core Insights - The article discusses the rapid evolution of visual language models (VLMs) and highlights the emergence of smaller yet powerful multimodal architectures, showcasing advancements in capabilities such as multimodal reasoning and long video understanding [1][3]. Group 1: New Model Trends - The article introduces the concept of "Any-to-any" models, which can input and output various modalities (images, text, audio) by aligning different modalities [5][6]. - New models like Qwen 2.5 Omni and DeepSeek Janus-Pro-7B exemplify the latest advancements in multimodal capabilities, enabling seamless input and output across different modalities [6][10]. - The trend of smaller, high-performance models (Smol Yet Capable) is gaining traction, promoting local deployment and lightweight applications [7][15]. Group 2: Reasoning Models - Reasoning models are emerging in the VLM space, capable of solving complex problems, with notable examples including Qwen's QVQ-72B-preview and Moonshot AI's Kimi-VL-A3B-Thinking [11][12]. - These models are designed to handle long videos and various document types, showcasing their advanced reasoning capabilities [14]. Group 3: Multimodal Safety Models - The need for multimodal safety models is emphasized, which filter inputs and outputs to prevent harmful content, with Google launching ShieldGemma 2 as a notable example [31][32]. - Meta's Llama Guard 4 is highlighted as a dense multimodal safety model that can filter outputs from visual language models [34]. Group 4: Multimodal Retrieval-Augmented Generation (RAG) - The development of multimodal RAG is discussed, which enhances the retrieval process for complex documents, allowing for better integration of visual and textual data [35][38]. - Two main architectures for multimodal retrieval are introduced: DSE models and ColBERT-like models, each with distinct approaches to processing and returning relevant information [42][44]. Group 5: Multimodal Intelligent Agents - The article highlights the emergence of visual language action models (VLA) that can interact with physical environments, with examples like π0 and GR00T N1 showcasing their capabilities [21][22]. - Recent advancements in intelligent agents, such as ByteDance's UI-TARS-1.5, demonstrate the ability to navigate user interfaces and perform tasks in real-time [47][54]. Group 6: Video Language Models - The challenges of video understanding are addressed, with models like Meta's LongVU and Qwen2.5VL demonstrating advanced capabilities in processing video frames and understanding temporal relationships [55][57]. Group 7: New Benchmark Testing - The article discusses the emergence of new benchmarks like MMT-Bench and MMMU-Pro, aimed at evaluating VLMs across a variety of multimodal tasks [66][67][68].
苹果发布FastVLM模型,可在iPhone上运行的极速视觉语言模型;昆仑万维宣布开源Matrix-Game大模型丨AIGC日报
创业邦· 2025-05-13 23:52
1.【昆仑万维宣布正式开源Matrix-Game大模型】5月13日,据昆仑万维消息,昆仑万维正式开源 (17B+)Matrix-Game大模型,即Matrix-Zero世界模型中的可交互视频生成大模型。Matrix-Game是Matrix系 列在交互式世界生成方向的正式落地,也是工业界首个开源的10B+空间智能大模型,它是一个面向游戏 世界建模的交互式世界基础模型,专为开放式环境中的高质量生成与精确控制而设计。(第一财经) 2.【百型智能推出国内首个外贸行业垂类Agent】百型智能推出国内首个外贸行业垂类Agent——AI外贸员 Zoe。据了解,Zoe可以根据企业目标拆解任务,独立完成从市场分析、寻找客户、精准筛选,到开发触 达、转化跟进的外贸开发拓客全链路,转化率高出传统人工方式10倍以上。(财联社) 3.【火山引擎发布豆包视频生成模型Seedance 1.0 lite】火山引擎发布豆包·视频生成模型Seedance 1.0 lite、 豆包1.5·视觉深度思考模型,并升级豆包·音乐模型,以更全面的模型矩阵、更丰富的智能体工具,帮助企 业打通从业务到智能体的应用链路。官方表示,此次全新发布的豆包视频生成模型 ...
32B本地部署!阿里开源最新多模态模型:主打视觉语言,数学推理也很强
量子位· 2025-03-25 00:59
西风 发自 凹非寺 量子位 | 公众号 QbitAI 就在DeepSeek-V3更新的同一夜,阿里通义千问Qwen又双叒叕一次梦幻联动了—— 发布 Qwen2.5-VL-32B-Instruct 。 此前开源家族视觉语言模型Qwen2.5-VL包括3B、7B和72B三种尺寸。 这一次的32B版本进一步兼顾尺寸和性能,可在本地运行。 同时经过强化学习优化,在三个方面改进显著: 对比近期开源的Mistral-Small-3.1-24B 、Gemma-3-27B-IT等, Qwen2.5-VL-32B在纯文本能力上也达到了同规模的SOTA表现。在多个基 准上,Qwen2.5-VL-32B甚至超过了72B。 举个栗子,比如根据一张交通指示牌照片,Qwen2.5-VL-32B就能做如下精细的图像理解和推理: 我正在这条路上驾驶一辆大卡车,现在12点了。我能在13点之前到达110公里远的地方吗? Qwen2.5-VL-32B首先对时间、距离、卡车限速进行分析,然后分步骤条理清晰推算出正确答案: 回答更符合人类偏好; 拥有更强的数学推理能力; 在图像解析、内容识别以及视觉逻辑推导等任务中,表现出更强的准确性和细粒度分析能力 ...
理想汽车(02015) - 自愿公告 2024年12月交付更新资料
2025-01-01 10:03
香港交易及結算所有限公司及香港聯合交易所有限公司對本公告的內容概不負責,對其準確性 或完整性亦不發表任何聲明,並明確表示概不會就本公告全部或任何部分內容而產生或因倚賴 該等內容而引致的任何損失承擔任何責任。 Li Auto Inc. 理想汽車 (於開曼群島註冊成立以不同投票權控制的有限責任公司) (股份代號:2015) 自願公告 2024年12月交付更新資料 股東及潛在投資者於買賣本公司證券時,務請謹慎行事。 承董事會命 理想汽車 董事長 李想 香港,2025年1月1日 12月,理想汽車交付量創月度新高。自開啟交付以來,本公司歷時五年創造了豪 華汽車品牌在中國市場達成超50萬台年交付量的最快紀錄。理想同學APP目前已 開放給移動端用戶下載,為更多的用戶創造價值。理想汽車OTA 7.0版本車機系 統將於1月開啟用戶推送,高速NOA升級為端到端技術架構,由此,理想汽車全 棧自研的端到端(E2E)+視覺語言模型(VLM)雙系統將打通城市NOA和高速NOA的 邊界,實現全場景端到端能力。此外,理想汽車還將推出首個智能推理可視化功 能,讓駕駛員理解智能系統的思考和執行過程,更安心地使用智能駕駛功能。 截至2024年12 ...