Workflow
视觉语言模型
icon
Search documents
85倍速度碾压:苹果开源FastVLM,能在iphone直接运行的视觉语言模型
机器之心· 2025-05-16 16:31
Core Viewpoint - Apple has open-sourced FastVLM, an efficient vision-language model that can run directly on iPhones, significantly enhancing visual understanding capabilities [2][6]. Group 1: Model Features and Performance - FastVLM addresses size and speed issues, achieving an 85-fold increase in the speed of the first token output compared to traditional models [6]. - The model uses a new hybrid visual encoder, FastViTHD, which combines convolutional layers and transformer modules, reducing the number of visual tokens needed for image processing by 16 times compared to traditional ViT and 4 times compared to FastViT [6][16]. - FastVLM is available in three parameter sizes: 0.5B, 1.5B, and 7B, each with stage 2 and stage 3 fine-tuning weights [7]. Group 2: Technical Innovations - The research emphasizes the importance of image resolution in VLM performance, particularly for text and data-dense tasks, while also addressing the challenges of high-resolution image processing [12][13]. - FastViTHD is specifically designed to enhance VLM efficiency when processing high-resolution images, achieving significant improvements in accuracy and latency compared to existing methods [16][33]. - The model architecture includes five stages, with a total parameter count of 125.1M, which is smaller than most mainstream ViT architectures while maintaining competitive performance [36][37]. Group 3: Efficiency and Optimization - FastVLM demonstrates superior performance in accuracy-latency trade-offs, outperforming previous models like ViT and FastViT under various conditions [46][47]. - The model's design allows for dynamic input resolution adjustments, optimizing performance based on the specific task and hardware capabilities [48][49]. - FastVLM's performance surpasses traditional token pruning methods, achieving lower visual token counts while maintaining higher accuracy [50][51].
百模竞发的 365 天:Hugging Face 年度回顾揭示 VLM 能力曲线与拐点 | Jinqiu Select
锦秋集· 2025-05-16 15:42
Core Insights - The article discusses the rapid evolution of visual language models (VLMs) and highlights the emergence of smaller yet powerful multimodal architectures, showcasing advancements in capabilities such as multimodal reasoning and long video understanding [1][3]. Group 1: New Model Trends - The article introduces the concept of "Any-to-any" models, which can input and output various modalities (images, text, audio) by aligning different modalities [5][6]. - New models like Qwen 2.5 Omni and DeepSeek Janus-Pro-7B exemplify the latest advancements in multimodal capabilities, enabling seamless input and output across different modalities [6][10]. - The trend of smaller, high-performance models (Smol Yet Capable) is gaining traction, promoting local deployment and lightweight applications [7][15]. Group 2: Reasoning Models - Reasoning models are emerging in the VLM space, capable of solving complex problems, with notable examples including Qwen's QVQ-72B-preview and Moonshot AI's Kimi-VL-A3B-Thinking [11][12]. - These models are designed to handle long videos and various document types, showcasing their advanced reasoning capabilities [14]. Group 3: Multimodal Safety Models - The need for multimodal safety models is emphasized, which filter inputs and outputs to prevent harmful content, with Google launching ShieldGemma 2 as a notable example [31][32]. - Meta's Llama Guard 4 is highlighted as a dense multimodal safety model that can filter outputs from visual language models [34]. Group 4: Multimodal Retrieval-Augmented Generation (RAG) - The development of multimodal RAG is discussed, which enhances the retrieval process for complex documents, allowing for better integration of visual and textual data [35][38]. - Two main architectures for multimodal retrieval are introduced: DSE models and ColBERT-like models, each with distinct approaches to processing and returning relevant information [42][44]. Group 5: Multimodal Intelligent Agents - The article highlights the emergence of visual language action models (VLA) that can interact with physical environments, with examples like π0 and GR00T N1 showcasing their capabilities [21][22]. - Recent advancements in intelligent agents, such as ByteDance's UI-TARS-1.5, demonstrate the ability to navigate user interfaces and perform tasks in real-time [47][54]. Group 6: Video Language Models - The challenges of video understanding are addressed, with models like Meta's LongVU and Qwen2.5VL demonstrating advanced capabilities in processing video frames and understanding temporal relationships [55][57]. Group 7: New Benchmark Testing - The article discusses the emergence of new benchmarks like MMT-Bench and MMMU-Pro, aimed at evaluating VLMs across a variety of multimodal tasks [66][67][68].
苹果发布FastVLM模型,可在iPhone上运行的极速视觉语言模型;昆仑万维宣布开源Matrix-Game大模型丨AIGC日报
创业邦· 2025-05-13 23:52
1.【昆仑万维宣布正式开源Matrix-Game大模型】5月13日,据昆仑万维消息,昆仑万维正式开源 (17B+)Matrix-Game大模型,即Matrix-Zero世界模型中的可交互视频生成大模型。Matrix-Game是Matrix系 列在交互式世界生成方向的正式落地,也是工业界首个开源的10B+空间智能大模型,它是一个面向游戏 世界建模的交互式世界基础模型,专为开放式环境中的高质量生成与精确控制而设计。(第一财经) 2.【百型智能推出国内首个外贸行业垂类Agent】百型智能推出国内首个外贸行业垂类Agent——AI外贸员 Zoe。据了解,Zoe可以根据企业目标拆解任务,独立完成从市场分析、寻找客户、精准筛选,到开发触 达、转化跟进的外贸开发拓客全链路,转化率高出传统人工方式10倍以上。(财联社) 3.【火山引擎发布豆包视频生成模型Seedance 1.0 lite】火山引擎发布豆包·视频生成模型Seedance 1.0 lite、 豆包1.5·视觉深度思考模型,并升级豆包·音乐模型,以更全面的模型矩阵、更丰富的智能体工具,帮助企 业打通从业务到智能体的应用链路。官方表示,此次全新发布的豆包视频生成模型 ...
ICML 2025 | 长视频理解新SOTA!蚂蚁&人大开源ViLAMP-7B,单卡可处理3小时视频
机器之心· 2025-05-12 09:06
该工作第一作者为中国人民大学高瓴人工智能学院硕士生程传奇, 目前于蚂蚁技术研究院实习,其主要研究领域为多模态大模型,蚂蚁技术研究院副研究员关健 为共同第一作者。 在视觉语言模型(Vision-Language Models,VLMs)取得突破性进展的当下,长视频理解的挑战显得愈发重要。以标准 24 帧率的标清视频为例,仅需数分钟即可 产生逾百万的视觉 token,这已远超主流大语言模型 4K-128K 的上下文处理极限。当面对影视级的长视频内容时,传统解决方案的不足愈加凸显:粗放式的帧采样 策略往往造成关键帧信息遗漏,而特征融合方法虽能降低数据维度,却不可避免地导致语义完整性受损。 近日, 蚂蚁和人大 的研究团队带来了一个创新性的解决方案。他们提出视觉语言大模型 ViLAMP (Video-Language Model with Mixed Precision),实现了对超长 视频的高效处理。这个方法的核心在于其独特的 " 混合精度 " 策略:对视频中的关键内容保持高精度分析,而对次要内容进行强力压缩,就像人类在观看视频时会 重点关注关键场景,而对过渡时空信息只做快速扫描一样。 论文标题:Scaling Vi ...
32B本地部署!阿里开源最新多模态模型:主打视觉语言,数学推理也很强
量子位· 2025-03-25 00:59
Core Viewpoint - The article discusses the release of the Qwen2.5-VL-32B-Instruct model by Alibaba's Tongyi Qwen, highlighting its advancements in performance and capabilities compared to previous models and competitors. Group 1: Model Specifications - The Qwen2.5-VL family includes three sizes: 3B, 7B, and 72B, with the new 32B version balancing size and performance for local operation [2][3]. - The 32B version has undergone reinforcement learning optimization, achieving state-of-the-art (SOTA) performance in pure text capabilities, even surpassing the 72B model in several benchmarks [4]. Group 2: Performance Improvements - The Qwen2.5-VL-32B demonstrates enhanced mathematical reasoning abilities, image analysis, content recognition, and visual logic deduction, providing clearer and more human-like responses [5]. - For example, it can analyze a traffic sign image and accurately calculate travel time based on distance and speed, showcasing its reasoning process [5][6]. Group 3: Open Source and Community Engagement - The model has been open-sourced and is available for testing on platforms like Hugging Face, allowing users to experience its capabilities directly [14][15]. - The rapid community engagement is evident, with users already running the model in various forums and discussions, indicating a strong interest in its applications [16][17].
理想汽车(02015) - 自愿公告 2024年12月交付更新资料
2025-01-01 10:03
香港交易及結算所有限公司及香港聯合交易所有限公司對本公告的內容概不負責,對其準確性 或完整性亦不發表任何聲明,並明確表示概不會就本公告全部或任何部分內容而產生或因倚賴 該等內容而引致的任何損失承擔任何責任。 Li Auto Inc. 理想汽車 (於開曼群島註冊成立以不同投票權控制的有限責任公司) (股份代號:2015) 自願公告 2024年12月交付更新資料 股東及潛在投資者於買賣本公司證券時,務請謹慎行事。 承董事會命 理想汽車 董事長 李想 香港,2025年1月1日 12月,理想汽車交付量創月度新高。自開啟交付以來,本公司歷時五年創造了豪 華汽車品牌在中國市場達成超50萬台年交付量的最快紀錄。理想同學APP目前已 開放給移動端用戶下載,為更多的用戶創造價值。理想汽車OTA 7.0版本車機系 統將於1月開啟用戶推送,高速NOA升級為端到端技術架構,由此,理想汽車全 棧自研的端到端(E2E)+視覺語言模型(VLM)雙系統將打通城市NOA和高速NOA的 邊界,實現全場景端到端能力。此外,理想汽車還將推出首個智能推理可視化功 能,讓駕駛員理解智能系統的思考和執行過程,更安心地使用智能駕駛功能。 截至2024年12 ...