Workflow
Visual - Language Model (VLM)
icon
Search documents
LeCun的JEPA已进化为视觉-语言模型,1.6B参数比肩72B Qwen-VL
机器之心· 2025-12-20 07:00
Core Insights - The article discusses the advancements in the Joint Embedding Predictive Architecture (JEPA) with the introduction of the visual-language model VL-JEPA, developed by a collaborative team from Meta, Hong Kong University of Science and Technology, Sorbonne University, and New York University [2][3]. Group 1: Model Overview - VL-JEPA is the first non-generative model based on the joint embedding predictive architecture that can perform general domain visual-language tasks in real-time [3]. - Unlike traditional visual-language models (VLMs) that generate tokens in an autoregressive manner, VL-JEPA predicts continuous embeddings of target text, focusing on task-relevant semantics while ignoring superficial language variations [4][13]. Group 2: Model Efficiency - The model transforms expensive token generation learning into more efficient latent space semantic prediction, which simplifies the target distribution and enhances the learning process [11][16]. - VL-JEPA can produce continuous target semantic embedding streams with very low latency due to its non-autoregressive nature, making it particularly beneficial for real-time applications like action tracking and scene recognition [17]. Group 3: Performance Comparison - In a comparative study, VL-JEPA demonstrated consistent higher performance in zero-shot description generation and classification while using approximately half the trainable parameters compared to traditional token-generating VLMs, indicating improved learning efficiency [20]. - The selective decoding strategy implemented in VL-JEPA reduced the number of decoding operations by about 2.85 times while maintaining overall output quality as measured by average CIDEr scores [22]. Group 4: Training Phases and Results - The VL-JEPA model undergoes two training phases, with the first phase producing VL-JEPA_BASE, which outperformed models like CLIP and SigLIP2 in average classification accuracy and retrieval recall across eight datasets [23][24]. - The second phase, which involves domain-specific training data, significantly enhances the classification performance of the model, resulting in VL-JEPA_SFT, which approaches the performance of specialized models [25][28]. Group 5: Application and Demonstration - The article includes demonstrations of VL-JEPA's capabilities, such as real-time robot state tracking, showcasing its practical applications in various fields [29].