苹果端侧AI两连发，模型体积减半、首字延迟降85倍，iPhone离线秒用

Core Insights - Apple has launched two new multimodal models, FastVLM and MobileCLIP2, on Hugging Face, focusing on speed and efficiency in processing visual and textual data [1][24] - FastVLM achieves a remarkable speed, with a first-word latency that is 85 times faster than competitors, thanks to its proprietary FastViTHD encoder [2][4] - MobileCLIP2 is designed to be lightweight, allowing inference directly on iPhones while maintaining high accuracy and low latency [9][14] Group 1: FastVLM Model - FastVLM is engineered for speed, reducing first-word latency to 1/85 of competing models, enabling real-time subtitle generation [1][4] - The model utilizes fewer visual tokens to process high-resolution inputs, significantly lowering computational burden while maintaining quality [4][6] - FastVLM's performance is consistently superior across various visual language tasks, achieving high accuracy with low latency [6][8] Group 2: MobileCLIP2 Model - MobileCLIP2 represents an upgrade from the previous MobileCLIP, focusing on a smaller model size without sacrificing understanding capabilities [9][14] - It allows for on-device inference on iPhones, enhancing privacy and speed by eliminating the need for cloud processing [14] - The model demonstrates improved performance in zero-shot tasks on ImageNet-1k, achieving comparable accuracy to larger models with reduced latency [14][24] Group 3: Developer Integration and Community Engagement - Apple has made both models and their demos available on Hugging Face, facilitating immediate user experience and integration for developers [15][19] - Developers can easily implement these models into iOS or macOS applications using Core ML and Swift Transformers [17][19] - The release signifies a shift towards practical applications of large models on mobile devices, making advanced AI capabilities accessible [24][25]