MobileCLIP2

Search documents
苹果端侧AI两连发,模型体积减半、首字延迟降85倍,iPhone离线秒用
3 6 Ke· 2025-09-08 02:42
苹果在 Hugging Face上放大招了!这次直接甩出两条多模态主线:FastVLM主打「快」,字幕能做到秒回;MobileCLIP2主打「轻」,在 iPhone 上也能起飞。更妙的是,模型和Demo已经全开放,Safari网页就能体验。大模型,真·跑上手机了。 就在刚刚,苹果在Hugging Face上重磅开闸: 这一次不是零碎更新,而是FastVLM与MobileCLIP2两条多模态主线集中亮相。 一个主打「快」,把首字延迟压到竞品的1/85; 另一个突出「轻」,在保持与SigLIP相当精度的同时,体积减半。 打开摄像头实时字幕、离线识别翻译、相册语义搜索,这些场景都能体验。 更重要的是,模型和Demo都已经开放,科研、应用到落地一步到位。 实时字幕,不再卡顿的多模态 FastVLM为何这么快?因为它换上了苹果自研的FastViTHD编码器。 传统多模态模型要么牺牲分辨率,要么被成千上万的视觉token拖慢推理。 而FastViTHD通过动态缩放和混合设计,让模型既能看清高分辨率图像,又能保持极低的延迟。 FastVit 与 FastVitHD 的性能对比:绿色曲线整体更靠左上,代表在同等规模下既更快又 ...
苹果沉默一年,终于亮出AI底牌
虎嗅APP· 2025-09-05 13:56
以下文章来源于直面AI ,作者涯角 直面AI . 聚焦前沿科技,抢先看到未来。 本文来自微信公众号: 直面AI ,作者:涯角,编辑:胡润,原文标题:《当全世界向云端大模型狂 奔,苹果选择回归设备》,题图来自:AI生成 几天前,苹果在 HuggingFace 上全面开源了视觉语言模型 FastVLM 和 MobileCLIP2,再次在 AI 社 区掀起震动。 这两款模型的直观特征只有一个字:快。FastVLM 在部分任务上的响应速度比同类模型快出 85 倍,并且能在 iPhone 这样的个人设备上流畅运行。但这并非一次孤立的技术秀。 与 MobileCLIP2 等开源模型一道,FastVLM 构成了苹果"B 计划"的核心:端侧 AI 小模型战略。 苹果亮剑小模型 用最通俗的语言解释FastVLM。它是一个"看得懂图、读得懂话"的多模态模型,重点有2个,1个是 名字里的"Fast"——快;另一个则是"VLM"。 正如其名,FastVLM最引人注目的特点就是"快"。这种快并非简单的性能提升,而是数量级的飞 跃,使其能够在手机、电脑等个人设备上实现以往需要云端服务器才能完成的实时任务。 因此,其作为VLM (视觉语言 ...
苹果沉默一年,终于亮出AI底牌
Hu Xiu· 2025-09-04 14:21
Core Viewpoint - Apple has open-sourced its visual language models FastVLM and MobileCLIP2 on HuggingFace, marking a significant move in the AI community, particularly focusing on edge AI small model strategies. Group 1: FastVLM Features and Performance - FastVLM is characterized by its speed, being 85 times faster than similar models in certain tasks and capable of running smoothly on personal devices like iPhones [2][6]. - The model's architecture includes a new hybrid visual encoder, FastViTHD, which reduces the number of tokens generated from high-resolution images, thus improving processing speed without sacrificing accuracy [7][9]. - FastVLM has multiple versions available, including 0.5B, 1.5B, and 7B, and can perform real-time tasks without cloud services, such as live browser subtitles [13][14]. Group 2: Apple's AI Strategy - Apple's "B Plan" focuses on small models for edge AI, contrasting with the industry trend towards large cloud-based models [3][40]. - The company has faced criticism for its slow progress in AI compared to competitors like Google and Microsoft, but it is now responding with significant investments and a clear strategy [36][39]. - Apple's approach emphasizes user privacy and seamless integration of hardware and software, which aligns with its core business model [43][49]. Group 3: Market Context and Implications - The interest in small models is rising across the industry, with various companies exploring their potential for specific vertical markets [54]. - Apple's focus on small models is seen as a strategic necessity to maintain its competitive edge and ensure user trust in privacy [50][56]. - The company's efforts in developing small models are positioned as a way to leverage its hardware capabilities while addressing the challenges posed by larger AI models [51][56].
苹果最新模型,5年前的iPhone能跑
3 6 Ke· 2025-09-01 11:37
Core Insights - Apple has made significant advancements in large model development with the introduction of the new multimodal foundation model MobileCLIP2, which features a multimodal reinforcement training mechanism [1][12] - The model is designed for zero-shot classification and retrieval tasks, with inference latency ranging from 3 to 15 milliseconds and parameter sizes between 50 million to 1.5 billion [1][3] Model Performance - MobileCLIP2-B has achieved a 2.2% improvement in zero-shot accuracy on the ImageNet-1k dataset compared to its predecessor [1][11] - The MobileCLIP2-S4 variant matches the zero-shot accuracy of the larger SigLIP-SO400M/14 model while having only half the parameter count [4][6] Training Mechanism - The improved training mechanism integrates enhanced teacher supervision and caption data to boost zero-shot performance [2][9] - This mechanism allows for direct deployment of multimodal models on mobile and edge devices, ensuring low latency and memory usage [2][8] Open Source and Developer Support - All model variants' pre-trained weights and data generation code have been made publicly available, facilitating direct deployment and benchmarking for developers [2][12] - The data generation code supports distributed scalable processing, enabling developers to create customized datasets for further research and rapid prototyping [8][12] Technical Details - The training mechanism effectively distills knowledge from multiple sources into a smaller model, enhancing semantic coverage and reducing computational overhead during training and inference [9][10] - The integration of teacher models and caption generation has been optimized through a two-phase protocol, significantly improving the model's ability to express image content [11][12]