CLIP
Search documents
1100多个模型殊途同归,指向一个「通用子空间」,柏拉图又赢一回?
机器之心· 2025-12-14 04:53
模型架构的重要性可能远超我们之前的认知。 最近,约翰斯・霍普金斯大学的一项研究发现: 1100 多个不同的神经网络,即使在完全不同的数据集上训练、用不同的初始化和超参数,最终学到的权重都会收 敛到一个共享的低维子空间。 这似乎是说明:存在一个「先验的」数学结构,所有神经网络都在逼近它。训练不是在「创造」什么,而是在「发现」一个早已存在的几何形式。换句话说,神 经网络「想学的东西」似乎高度一致,架构决定了它能学什么,比数据影响更大。 机器之心报道 编辑:张倩 这一发现有助于解释很多「神秘」现象,比如为什么过参数化的模型(参数远多于训练样本)还能泛化?为什么不同初始化最终学到相似的表示?为什么 LoRA、 权重共享这些技术能 work?如果神经网络确实在共享子空间内学习,这将为隐式正则化、可迁移性以及稀疏训练方法的有效性提供支持性解释,同时也为诸如高 效合并、新的优化技术、更快更高效的学习和推理等应用开辟道路。 这篇论文在 Alphaxiv、X 等平台上吸引了很多关注,一度攀升到 Alphaxiv 榜一的位置。 有人说,柏拉图又赢了一回。(注:柏拉图的理念论认为:我们看到的具体事物(桌子、马、圆形)都只是「理念」 ...
长文本检索大突破,联通团队研发的新模型,准确率提升近两成
Sou Hu Cai Jing· 2025-12-02 20:15
文 |姑苏九歌 编辑 |姑苏九歌 你有没有遇到过这种情况,想在网上找一件"白色福特F250皮卡,带有色车窗和超大轮胎",结果搜出 来一堆普通白色轿车?这可不是你描述得不够清楚,而是AI在处理长文本描述时犯了难。 现在的图像检索模型,比如大家熟悉的CLIP,处理简单描述还行,一旦遇到这种带多个特征的复杂描 述,反而容易"抓不住重点"。 有时候描述得越详细,匹配准确率反而越低,就像考试时答太多无关内容反而扣分一样。 这时候,HiMo-CLIP就登场了。 这款由中国联通数据科学与人工智能研究院团队研发的新模型,在AAAI会议上做了口头报告,一下子 就解决了这个"说越多错越多"的老大难问题。 让AI学会"抓重点"的黑科技 HiMo-CLIP最聪明的地方,就是它能像人一样自动识别描述中的关键信息。 团队给这个能力起了个专业名字叫HiDe模块,说白了就是动态语义指纹提取技术。 具体怎么做呢?它会通过统计学方法,在一堆相似的描述中找出最有区分度的特征。 比如提到福特皮卡,它会自动发现"超大轮胎"比"有色车窗"更能帮它准确找到目标。 这种方法比以前固定模板分词或者人工标注层级要高效得多,准确率能达到89.3%。 更厉害的是, ...
NeurIPS 2025 | 上下文元学习实现不微调跨被试脑活动预测
机器之心· 2025-11-19 04:07
Core Insights - The article discusses the development of BraInCoRL, a novel brain encoding model that utilizes meta-learning and context learning to predict brain responses from visual stimuli with minimal data requirements [3][32]. - This model addresses the limitations of traditional visual encoding models, which require extensive data collection for each individual, making them costly and difficult to implement in clinical settings [6][32]. Background and Innovation - The research highlights significant functional differences in the human higher visual cortex among individuals, necessitating the creation of brain encoding models that can effectively represent these differences [2][6]. - BraInCoRL allows for the prediction of brain responses using only a small number of example images and their corresponding brain activity data, eliminating the need for model fine-tuning [3][32]. Methodology - The BraInCoRL framework treats each voxel as an independent function mapping visual stimuli to neural responses, leveraging meta-learning and context learning to enhance data efficiency and generalization [7][10]. - During training, the model learns shared structures of visual cortex responses from multiple subjects, and during testing, it can generate a subject-specific voxel encoder using just a few image-brain response pairs [11][20]. Experimental Results - BraInCoRL demonstrates high data efficiency, achieving comparable variance explanation to models trained on thousands of images while only using 100 context images [20][22]. - The model shows robust performance across different datasets and scanning protocols, confirming its cross-device and cross-protocol generalization capabilities [22][23]. - Semantic clustering visualizations reveal clear functional organization within the visual cortex, with distinct areas for faces, scenes, and other categories [26][27]. Conclusion - BraInCoRL introduces in-context learning to computational neuroscience, creating a data-efficient, interpretable, and language-interactive framework for visual cortex encoding [32]. - This innovation significantly lowers the barriers for constructing individualized brain encoding models, paving the way for applications in clinical neuroscience and other data-limited scenarios [32].
NeurIPS 2025 Spotlight | 香港大学提出无需数据标记的ViT密集表征增强方法
机器之心· 2025-11-19 04:07
Core Insights - The article discusses the introduction of PH-Reg, a novel method for enhancing Vision Transformers (ViTs) by removing artifacts from dense features without requiring data labeling, thus improving model performance in fine-grained tasks [2][6][19]. Group 1: Methodology - PH-Reg employs a test-time augmentation denoising strategy to eliminate artifacts from the dense features of teacher models, resulting in a student model that outputs artifact-free dense features [2][11]. - The self-distillation framework of PH-Reg allows for enhancement of the student model architecture with minimal intrusion, focusing updates on specific components while preserving the core information of the pre-trained ViT model [11][20]. - The method is designed to be plug-and-play, requiring no retraining and enabling efficient artifact removal from existing pre-trained models like CLIP and DINOv2 [19][22]. Group 2: Experimental Results - In semantic segmentation tasks across eight benchmark datasets, PH-Reg outperformed mainstream methods such as MaskCLIP and SCLIP in seven datasets, demonstrating its robustness and effectiveness [13][21]. - Specifically, the method achieved a significant improvement of 5.04% in mean Intersection over Union (mIoU) on the VOC21 dataset and 3.64% on the ADE20K dataset for the CLIP model [21]. - The training time for PH-Reg is reduced by over 58.9% compared to traditional methods, with a total training time of 9000 minutes, significantly less than the 21908 minutes required for DVT [17][22]. Group 3: Advantages - PH-Reg's core advantage lies in its independence from gradient-based neural field learning, allowing for a single-stage distillation process that minimizes storage requirements and computational resources [22]. - The method can compute all distillation targets in real-time without the need for additional storage space, contrasting with DVT's requirement of 1.4 TB for neural field feature data [22].
360开源FG-CLIP2:登顶29项全球基准测试
Yang Zi Wan Bao Wang· 2025-11-03 12:17
Core Insights - The recent launch of 360 Group's open-source visual language alignment model FG-CLIP2 has generated significant attention in the global tech community, marking a breakthrough for China in the AI foundational model sector [1][7] - FG-CLIP2 outperformed major competitors like Google's SigLIP 2 and Meta's MetaCLIP2 across 29 authoritative benchmark tests, showcasing its advanced capabilities in AI [1][6] Performance and Innovations - FG-CLIP2 represents a qualitative leap in fine-grained recognition, addressing long-standing challenges faced by traditional CLIP models in distinguishing subtle object attributes and complex spatial relationships [3][6] - The model features three fundamental innovations: a hierarchical alignment architecture for macro and micro scene understanding, a dynamic attention mechanism for efficient detail capture, and a bilingual optimization strategy for balanced understanding of Chinese and English [6][7] Industry Applications - FG-CLIP2's capabilities extend to various industries, enhancing e-commerce by enabling precise searches based on complex descriptions, thereby improving product recommendation and reducing return rates [7] - In the field of embodied intelligence, FG-CLIP2 acts as a "smart eye" for robots, allowing them to execute complex tasks in dynamic environments [7] - The model also supports AIGC content generation, content review, and security monitoring, ensuring accuracy and efficiency across multiple critical scenarios [7] Strategic Importance - The open-sourcing of FG-CLIP2 is a strategic move by 360 Group, reinforcing its commitment to building a self-sufficient AI technology ecosystem in China [7]
牛津VGG、港大、上交发布ELIP:超越CLIP等,多模态图片检索的增强视觉语言大模型预训练
机器之心· 2025-10-29 11:02
Core Insights - The article discusses the significance of multimodal image retrieval in computer vision and multimodal machine learning, highlighting the use of large-scale pre-trained models like CLIP and SigLIP for enhanced zero-shot capabilities [2] - A new method called ELIP (Enhance Language-Image Pre-training) is proposed to improve the performance of visual-language models for text-image retrieval, which has been accepted as a best paper nominee at the IEEE International Conference on Content-Based Multimedia Indexing [2] Method Overview - The ELIP method involves an initial ranking of images using traditional CLIP/SigLIP, followed by a re-ranking of the top-k candidates using a simple MLP mapping network that incorporates text features into the image encoder [5] - ELIP can be applied to various large models, including CLIP, SigLIP, and BLIP-2, referred to as ELIP-C, ELIP-S, ELIP-S-2, and ELIP-B respectively [5] Challenges in Academic Research - The article notes that pre-training visual-language models is typically an industrial endeavor, but the proposed method allows for training with limited resources, such as two GPUs [8] Innovations in Model Architecture - The architecture innovation involves fixing the weights of large image and text encoders while training only the MLP mapping network, which consists of three layers of linear transformations and GeLU activations [9] - The training process involves mapping text features to the visual feature space to guide image encoding, using InfoNCE loss for CLIP and Sigmoid loss for SigLIP [9] Innovations in Training Data - ELIP addresses the challenge of limited GPU resources by creating hard sample training batches from CLIP feature similarities, enhancing the model's discriminative ability [13] - The article provides examples of how similar features are grouped to form hard samples for training [15] New Evaluation Datasets - In addition to standard datasets like COCO and Flickr, two new out-of-distribution (OOD) datasets, Occluded COCO and ImageNet-R, are introduced to evaluate the model's performance under different conditions [18] Experimental Results - The results indicate significant improvements in image retrieval performance for models using ELIP, with ELIP-S achieving a recall@1 of 61.03 on COCO, compared to 54.21 for SigLIP [21] - ELIP-B applied to BLIP-2 also shows enhanced performance, surpassing the latest Q-Pert method [20] Attention Mechanism Observations - The authors observed that ELIP improves the attention of the CLS token towards relevant areas in images when the text query is related, enhancing information extraction [23]
NeurIPS 2025|VFMTok: Visual Foundation Models驱动的Tokenizer时代来临
机器之心· 2025-10-28 09:37
Core Insights - The article discusses the potential of using frozen Visual Foundation Models (VFMs) as effective visual tokenizers for autoregressive image generation, highlighting their ability to enhance image reconstruction and generation tasks [3][11][31]. Group 1: Traditional Visual Tokenizers - Traditional visual tokenizers like VQGAN require training from scratch, leading to a potential space that lacks high-level semantic information and has high redundancy [4][7]. - The organization of the latent space in traditional models is chaotic, resulting in longer training times and the need for additional techniques like Classifier-Free Guidance (CFG) for high-fidelity image generation [7][12]. Group 2: Visual Foundation Models (VFMs) - Pre-trained VFMs such as CLIP, DINOv2, and SigLIP2 excel in extracting rich semantic and generalizable visual features, primarily used for image content understanding tasks [4][11]. - The hypothesis proposed by the research team is that the latent features from these VFMs can also be utilized for image reconstruction and generation tasks [4][10]. Group 3: VFMTok Architecture - VFMTok utilizes frozen VFMs to construct high-quality visual tokenizers, employing multi-level feature extraction to capture both low-level details and high-level semantics [14][17]. - The architecture includes a region-adaptive quantization mechanism that improves token efficiency by focusing on consistent patterns within the image [18][19]. Group 4: Experimental Findings - VFMTok demonstrates superior performance in image reconstruction and autoregressive generation compared to traditional tokenizers, achieving better reconstruction quality with fewer tokens (256) [23][28]. - The convergence speed of autoregressive models during training is significantly improved with VFMTok, outperforming classic models like VQGAN [24][26]. Group 5: CFG-Free Performance - VFMTok shows consistent performance with or without CFG, indicating strong semantic consistency in its latent space, which allows for high-fidelity class-to-image generation without additional guidance [33]. - The reduction in token count leads to approximately four times faster inference speed during the generation process [33]. Group 6: Future Outlook - The findings suggest that leveraging the prior knowledge from VFMs is crucial for constructing high-quality latent spaces and developing the next generation of tokenizers [32]. - The potential for a unified tokenizer that is semantically rich and efficient across various generative models is highlighted as a future research direction [32].
X @AscendEX
AscendEX· 2025-09-15 10:00
Listing Announcement - AscendEX lists CLIP under the trading pair CLIP/USDT [1] - Deposits for CLIP started on September 15, 8:00 AM UTC [1] - Trading for CLIP started on September 15, 10:00 AM UTC [1] - Withdrawals for CLIP will start on September 16, 10:00 AM UTC [1] Trading Information - CLIP/USDT trading is now live on AscendEX [1]
X @AscendEX
AscendEX· 2025-09-15 07:02
🚀 #AscendEX will list the #CLIP under the trading pair #CLIP/USDT. Details are as follows:✅Deposit: September 15, 8:00 AM UTC✅Trading: September 15, 10:00 AM UTC✅Withdrawal: September 16, 10:00 AM UTC👀 More Details👉https://t.co/VLUup1Fosg🔗 Trade Now👉 https://t.co/c6hC9ODBYI👥 Join our official group👉 https://t.co/17FuV2k15z#AscendEX #Crypto #CLIP ...
李飞飞的答案:大模型之后,Agent向何处去?
虎嗅APP· 2025-09-07 02:51
Core Viewpoint - The article discusses the emergence of Agent AI, highlighting its potential to revolutionize various fields through a new cognitive architecture that integrates perception, cognition, action, learning, and memory [4][9][10]. Summary by Sections Introduction to Agent AI - 2025 is anticipated to be the year of Agent AI, with increasing interest in concepts like AI Agents and Agentic AI [4]. - A significant paper led by Fei-Fei Li titled "Agent AI: Surveying the Horizons of Multimodal Interaction" has sparked widespread discussion in the industry [4][6]. Framework of Agent AI - The paper establishes a clear framework for Agent AI, integrating various technologies into a unified perspective [6][7]. - It outlines five core modules: Environment and Perception, Cognition, Action, Learning, and Memory, which together form a dynamic cognitive loop [10][12][14][16][17]. Core Modules Explained - **Environment and Perception**: Agents actively perceive information from their surroundings, incorporating task planning and skill observation [12]. - **Cognition**: Acts as the processing center, utilizing large language models (LLMs) and visual language models (VLMs) for reasoning and strategy formulation [14]. - **Action**: Converts cognitive decisions into executable commands, affecting the environment [15]. - **Learning**: Emphasizes continuous learning through various mechanisms, allowing agents to adapt based on feedback [16]. - **Memory**: Features a structured system for long-term knowledge retention, enabling agents to leverage past experiences [17]. Role of Large Models - The development of Agent AI is driven by the maturity of foundation models, particularly LLMs and VLMs, which provide agents with extensive knowledge and planning capabilities [20]. - The paper addresses the challenge of "hallucination" in models, emphasizing the importance of environmental interaction to mitigate this issue [21][22]. Application Potential - The paper explores Agent AI's applications in three key areas: - **Gaming**: Agent AI can create dynamic NPCs that interact meaningfully with players, enhancing immersion [24][25]. - **Robotics**: Robots can execute complex tasks based on natural language commands, improving user interaction [27]. - **Healthcare**: Agent AI can assist in preliminary diagnostics and patient monitoring, increasing efficiency in healthcare delivery [29][31]. Conclusion - The paper recognizes that Agent AI is still in its early stages, facing challenges in integrating multiple modalities and creating general agents for diverse applications [32]. - It proposes new evaluation benchmarks to guide the development and measure progress in the field [32].