Workflow
CLIP
icon
Search documents
X @AscendEX
AscendEX· 2025-09-15 10:00
🚀 #CLIP/USDT is LIVE! 🎉🟢 Trading for #CLIP has officially started.💼 Dive into the action now!Trade here 👉https://t.co/ae4IimpXb6#AscendEX #CLIPAscendEX (@AscendEX_):🚀 #AscendEX will list the #CLIP under the trading pair #CLIP/USDT. Details are as follows:✅Deposit: September 15, 8:00 AM UTC✅Trading: September 15, 10:00 AM UTC✅Withdrawal: September 16, 10:00 AM UTC👀 More Details👉https://t.co/Xy3VWTdoSg🔗 Trade Now👉 https://t.co/BHJVwkQRmx ...
X @AscendEX
AscendEX· 2025-09-15 07:02
🚀 #AscendEX will list the #CLIP under the trading pair #CLIP/USDT. Details are as follows:✅Deposit: September 15, 8:00 AM UTC✅Trading: September 15, 10:00 AM UTC✅Withdrawal: September 16, 10:00 AM UTC👀 More Details👉https://t.co/VLUup1Fosg🔗 Trade Now👉 https://t.co/c6hC9ODBYI👥 Join our official group👉 https://t.co/17FuV2k15z#AscendEX #Crypto #CLIP ...
李飞飞的答案:大模型之后,Agent向何处去?
虎嗅APP· 2025-09-07 02:51
Core Viewpoint - The article discusses the emergence of Agent AI, highlighting its potential to revolutionize various fields through a new cognitive architecture that integrates perception, cognition, action, learning, and memory [4][9][10]. Summary by Sections Introduction to Agent AI - 2025 is anticipated to be the year of Agent AI, with increasing interest in concepts like AI Agents and Agentic AI [4]. - A significant paper led by Fei-Fei Li titled "Agent AI: Surveying the Horizons of Multimodal Interaction" has sparked widespread discussion in the industry [4][6]. Framework of Agent AI - The paper establishes a clear framework for Agent AI, integrating various technologies into a unified perspective [6][7]. - It outlines five core modules: Environment and Perception, Cognition, Action, Learning, and Memory, which together form a dynamic cognitive loop [10][12][14][16][17]. Core Modules Explained - **Environment and Perception**: Agents actively perceive information from their surroundings, incorporating task planning and skill observation [12]. - **Cognition**: Acts as the processing center, utilizing large language models (LLMs) and visual language models (VLMs) for reasoning and strategy formulation [14]. - **Action**: Converts cognitive decisions into executable commands, affecting the environment [15]. - **Learning**: Emphasizes continuous learning through various mechanisms, allowing agents to adapt based on feedback [16]. - **Memory**: Features a structured system for long-term knowledge retention, enabling agents to leverage past experiences [17]. Role of Large Models - The development of Agent AI is driven by the maturity of foundation models, particularly LLMs and VLMs, which provide agents with extensive knowledge and planning capabilities [20]. - The paper addresses the challenge of "hallucination" in models, emphasizing the importance of environmental interaction to mitigate this issue [21][22]. Application Potential - The paper explores Agent AI's applications in three key areas: - **Gaming**: Agent AI can create dynamic NPCs that interact meaningfully with players, enhancing immersion [24][25]. - **Robotics**: Robots can execute complex tasks based on natural language commands, improving user interaction [27]. - **Healthcare**: Agent AI can assist in preliminary diagnostics and patient monitoring, increasing efficiency in healthcare delivery [29][31]. Conclusion - The paper recognizes that Agent AI is still in its early stages, facing challenges in integrating multiple modalities and creating general agents for diverse applications [32]. - It proposes new evaluation benchmarks to guide the development and measure progress in the field [32].
李飞飞的答案:大模型之后,Agent 向何处去?
创业邦· 2025-09-05 11:12
Core Insights - The article discusses a significant paper led by Fei-Fei Li that establishes a clear framework for the emerging field of Agent AI, outlining its capabilities and potential applications [5][6][9] - The paper presents a comprehensive cognitive architecture for Agent AI, consisting of five core modules: Environment and Perception, Cognition, Action, Learning, and Memory, which together form a dynamic and iterative closed-loop system [11][12][18] Summary by Sections Agent AI Framework - The new Agent AI paradigm is not merely a combination of existing technologies but represents a forward-thinking approach to the development of Artificial General Intelligence (AGI) [12] - The framework integrates various technological strands, including dialogue models, visual-language models, and reinforcement learning, into a unified perspective on multimodal agents [9][12] Core Modules of Agent AI - **Environment and Perception**: This module allows agents to actively perceive information from the physical or virtual world, incorporating task planning and skill observation [13] - **Cognition**: Defined as the processing center of the agent, this module utilizes large language models (LLMs) and visual-language models (VLMs) to interpret sensory information and develop strategies [14] - **Action**: This module generates specific operational commands based on cognitive decisions, enabling interaction with both physical and virtual environments [15] - **Learning**: Emphasizes the agent's ability to continuously learn and evolve through various mechanisms, including reinforcement learning and imitation learning [16] - **Memory**: Unlike traditional models, this module provides a structured and persistent memory system that allows agents to leverage past experiences for future tasks [17][18] Role of Large Models - Large foundational models, particularly LLMs and VLMs, serve as the cognitive backbone of Agent AI, enabling agents to perform complex tasks with minimal predefined rules [20] - The paper highlights the challenge of "hallucination," where models generate inaccurate content, and proposes environmental interaction as a solution to mitigate this issue [21] Ethical and Regulatory Considerations - The article stresses the importance of inclusivity and ethical considerations in the design of Agent AI, advocating for diverse training data and bias detection mechanisms [22] - It also addresses the need for clear regulations and frameworks to ensure data privacy and security, especially in sensitive applications [22] Application Potential - **Gaming**: Agent AI can revolutionize non-player character (NPC) behavior, allowing for dynamic interactions and personalized experiences in gaming environments [25][26] - **Robotics**: Agents can autonomously plan and execute complex physical tasks based on natural language commands, enhancing user interaction with robots [28] - **Healthcare**: Agent AI can assist in preliminary medical consultations and patient monitoring, significantly improving healthcare delivery, especially in resource-limited settings [30][32] Future Directions - The article acknowledges that Agent AI is still in its early stages and faces challenges in achieving deep integration across various modalities and domains [33] - It emphasizes the need for standardized evaluation metrics to assess agent intelligence and guide future research [33]
李飞飞的答案:大模型之后,Agent向何处去?
Hu Xiu· 2025-09-05 00:34
Core Insights - The article discusses the rising prominence of Agent AI, with 2025 being viewed as a pivotal year for this technology [1][2] - A significant paper led by Fei-Fei Li titled "Agent AI: Surveying the Horizons of Multimodal Interaction" has sparked extensive discussion in the industry [3][6] Summary by Sections Overview of the Paper - The paper, consisting of 80 pages, provides a clear framework for the somewhat chaotic field of Agent AI, integrating various technological strands into a new multimodal perspective [5][6] - It emphasizes the evolution from large models to agents, reflecting the current strategies of major players like Google, OpenAI, and Microsoft [6] New Paradigm of Agent AI - The paper introduces a novel cognitive architecture for Agent AI, which is not merely a compilation of existing technologies but a forward-thinking approach to the development of Artificial General Intelligence (AGI) [9] - It defines five core modules: Environment and Perception, Cognition, Action, Learning, and Memory, which together form an interactive cognitive loop [10][26] Core Modules Explained - **Environment and Perception**: Agents actively perceive information from their surroundings in a multimodal manner, incorporating various data types [12][13] - **Cognition**: Acts as the processing center for agents, enabling complex activities such as reasoning and empathy [15][16] - **Action**: Converts cognitive decisions into specific operational commands, affecting both physical and virtual environments [18][19] - **Learning**: Highlights the continuous learning and self-evolution capabilities of agents through various mechanisms [20][21] - **Memory**: Offers a structured system for long-term knowledge retention, allowing agents to leverage past experiences for new tasks [23][24] Role of Large Models - The framework's feasibility is attributed to the maturity of large foundational models, particularly LLMs and VLMs, which provide essential cognitive capabilities for agents [28][29] - These models enable agents to decompose vague instructions into actionable tasks, significantly reducing the complexity of task programming [30][31] Challenges and Ethical Considerations - The paper identifies the issue of "hallucination" in models, where they may generate inaccurate content, posing risks in real-world interactions [32][33] - It emphasizes the need for inclusivity in designing Agent AI, addressing biases in training data and ensuring ethical interactions [36][39] - The importance of establishing regulatory frameworks for data privacy and security in Agent AI applications is also highlighted [38][39] Application Potential - The paper explores the vast application potential of Agent AI in gaming, robotics, and healthcare [40] - In gaming, Agent AI can create dynamic NPCs that interact meaningfully with players, enhancing immersion [42][43] - In robotics, agents can autonomously execute complex tasks based on simple verbal commands, streamlining user interaction [48][49] - In healthcare, Agent AI can assist in preliminary diagnostics and patient monitoring, improving efficiency in resource-limited settings [54][57] Future Directions - The paper acknowledges that Agent AI is still in its early stages, facing challenges in integrating multiple modalities and creating general-purpose agents [58][60] - It proposes new evaluation benchmarks to measure agent intelligence and guide future research [61]
OpenAI提出的CLIP,被Meta联合谢赛宁、刘壮,扩展到全球300+语言
机器之心· 2025-07-31 05:11
Core Viewpoint - The article discusses the introduction of MetaCLIP 2, a novel method for training the CLIP model on a global scale without relying on external resources, addressing the challenges of multilingual data processing and enhancing model performance across languages [2][4]. Group 1: MetaCLIP 2 Overview - MetaCLIP 2 is the first method to train CLIP from scratch on native global image-text pairs, overcoming the limitations of previous models that primarily focused on English data [2][5]. - The method includes three core innovations: metadata expansion to over 300 languages, a data filtering algorithm to balance concept distribution across languages, and a global training framework that proportionally increases the use of image-text pairs as non-English data is introduced [5][20]. Group 2: Performance Improvements - MetaCLIP 2 demonstrates that non-English data can enhance the capabilities of English models and vice versa, effectively breaking the "multilingual curse" [10][31]. - The model achieved state-of-the-art (SOTA) results in various multilingual benchmarks, including improvements of 3.8% on Babel-ImageNet and 1.1% on XM3600, among others [32][34]. Group 3: Training Methodology - The training framework of MetaCLIP 2 maintains consistency with OpenAI's CLIP architecture while introducing key components such as a multilingual text tokenizer and scaling of seen training pairs [26][30]. - The model's training data was expanded from 13 billion pairs to 29 billion pairs, resulting in significant performance enhancements across both English and multilingual tasks [38][39]. Group 4: Cultural and Linguistic Diversity - MetaCLIP 2 retains a comprehensive distribution of global images, enhancing geographical localization and regional recognition capabilities [13][15]. - The model directly learns from image descriptions written by native speakers, avoiding reliance on machine translation, which improves the authenticity and accuracy of the training data [12][16].
多模态大语言模型(LLM) 和视频语言预训练的关键进展、应用、数据集和方法
3 6 Ke· 2025-07-23 02:45
Core Insights - The article discusses the recent advancements in large-scale video language pre-training tasks, focusing on representation learning using weakly labeled subtitles and videos [1][2]. Group 1: Introduction - The task of video language pre-training employs weak subtitles and videos for representation learning, utilizing a standard learning paradigm of pre-training and fine-tuning [2]. - Pre-training typically involves self-supervised learning on large datasets, while fine-tuning is conducted on smaller datasets for specific tasks, reducing the need for training new models for different tasks [2]. Group 2: Recent Developments and Applications - The importance of dataset size for representation learning is emphasized, with researchers utilizing large, weakly labeled cross-modal data from the internet, leading to a surge in cross-modal task research [3]. - Significant progress in visual language pre-training is highlighted by the Contrastive Language-Image Pre-training (CLIP) model, which learns multimodal representations from weakly supervised data [3]. - Large video datasets like Howto100M, containing 136 million narrated videos, have been introduced, facilitating advancements in video language pre-training and opening new avenues for video understanding tasks [3]. Group 3: Open Video Language Pre-training Datasets - The scale and quality of pre-training datasets are crucial for learning robust visual representations, especially for Transformer-based models [6]. - Key datasets include: - Kinetics: A large-scale action recognition dataset with up to 650,000 video clips across various human action categories [7]. - ActivityNet Captions: Contains 20,000 videos with 100,000 unique descriptions [8]. - Howto100M: A large narrated video dataset with over 136 million video clips [8]. - WebVid: Contains over 2 million weakly labeled videos [8]. - HD-VILA: The first high-resolution dataset with 100 million video clips [8]. Group 4: Video Language Pre-training Methods - Recent methods primarily use Transformer as a feature extractor for learning from large-scale multimodal data, categorized into single-stream and two-stream approaches [10]. - Single-stream methods include VideoBERT, HERO, and VATT, focusing on encoding multimodal inputs [10][11]. - Two-stream methods like CBT and UniVL provide greater flexibility by separately extracting features from different modalities [11].
ICCV 2025|训练太复杂?对图片语义、布局要求太高?图像morphing终于一步到位
机器之心· 2025-07-18 00:38
Core Viewpoint - The article introduces FreeMorph, a novel training-free image morphing method that enables high-quality and smooth transitions between two input images without the need for pre-training or additional annotations [5][32]. Group 1: Background and Challenges - Image morphing is a creative task that allows for smooth transitions between two distinct images, commonly seen in animations and photo editing [3]. - Traditional methods relied on complex algorithms and faced challenges with high training costs, data dependency, and instability in real-world applications [4]. - Recent advancements in deep learning methods like GANs and VAEs have improved image morphing but still struggle with training costs and adaptability [4][5]. Group 2: FreeMorph Methodology - FreeMorph addresses the challenges of image morphing by eliminating the need for training, achieving effective morphing with just two images [5]. - The method incorporates two key innovations: spherical feature aggregation and prior-driven self-attention mechanisms, enhancing the model's ability to maintain identity features and ensure smooth transitions [11][32]. - A step-oriented motion flow is introduced to control the transition direction, allowing for a coherent and gradual morphing process [21][32]. Group 3: Experimental Results - FreeMorph has been evaluated against existing methods, demonstrating superior performance in generating high-fidelity results across diverse scenarios, including images with varying semantics and layouts [27][30]. - The method effectively captures subtle changes, such as color variations in objects or nuanced facial expressions, showcasing its versatility [27][30]. Group 4: Limitations - Despite its advancements, FreeMorph has limitations, particularly when handling images with significant semantic or layout differences, which may result in less smooth transitions [34]. - The method inherits biases from the underlying Stable Diffusion model, affecting accuracy in specific contexts, such as human limb structures [34].
被 AI 大厂逼至绝望,这帮欧洲人发起了一场“科学复兴运动”
AI科技大本营· 2025-06-24 07:45
Core Viewpoint - The article discusses the emergence of LAION as a response to the increasing centralization and opacity in the field of artificial intelligence, emphasizing the need for open datasets and reproducibility in research [7][25]. Group 1: Emergence of LAION - LAION was founded to combat the trend of AI research being locked in "black boxes" controlled by a few tech giants, which hinders scientific reproducibility [2][7]. - The initiative began with Christoph Schuhmann's idea to create a dataset from Common Crawl, leading to the formation of a collaborative network of scientists and enthusiasts [3][4]. - The organization is defined by its commitment to being 100% non-profit and free, aiming to "liberate machine learning research" [3][4]. Group 2: Collaboration and Resources - The collaboration between LAION and top-tier computing resources allowed for the reproduction and even surpassing of models locked in proprietary systems [4][5]. - Key figures from various backgrounds, including academia and industry, joined LAION, contributing to its mission and enhancing its research capabilities [5][10]. - The organization has successfully released large-scale open datasets like LAION-400M and LAION-5B, which have been widely adopted in the community [16][17]. Group 3: Challenges and Achievements - The process of building reproducible datasets is complex and requires significant effort, including data collection and quality assurance [28][31]. - Despite initial expectations of mediocrity, models trained on LAION's open datasets performed comparably or better than proprietary models, demonstrating the potential of open research [17][29]. - The transparency of open datasets allows for the identification and rectification of issues, enhancing the overall quality of research outputs [30][31]. Group 4: The Future of AI Research - The article highlights the importance of open data and reproducibility in advancing AI research, suggesting that a collaborative approach can lead to significant breakthroughs [25][26]. - The ongoing exploration of reasoning models indicates a shift towards improving the robustness and reliability of AI systems, with a focus on expanding the dataset for training [41][43]. - The future of AI research may depend on the ability to create a more organized framework within the open-source community to harness collective talent and resources [45].
大模型能够自发形成“人类思维地图”!Nature子刊重磅研究揭示多模态大模型类脑机制
机器人圈· 2025-06-11 11:43
Core Viewpoint - The research published in "Nature Machine Intelligence" demonstrates that multimodal large language models (MLLMs) can develop human-like object concept representations, challenging the notion that these models merely mimic human language without true understanding [2][4]. Group 1: Research Findings - The study analyzed 4.7 million behavioral judgment data to construct an "concept map" of AI models, confirming that MLLMs can form object concept representations similar to humans [3][6]. - The research identified 66 core dimensions of cognition through a sparse positive definite similarity embedding method, revealing that both ChatGPT-3.5 and the multimodal Gemini model exhibit stable low-dimensional representation structures [9]. - MLLMs spontaneously formed 18 high-level object concept categories with a classification accuracy of 78.3%, approaching human accuracy of 87.1% [13]. Group 2: Methodology - The research employed a novel "behavioral cognitive probe" method, integrating computational modeling, behavioral experiments, and neuroscience to analyze AI cognition [8]. - A triplet odd-one-out task was designed to assess the similarity of object representations between AI and humans, allowing for a comparative analysis of decision-making processes [5][31]. Group 3: Cognitive Dimensions - The study provided semantic labels for the cognitive dimensions of AI models, categorizing them into dimensions related to semantic categories, perceptual features, and physical components [17][19][20]. - The findings indicated a significant correlation between MLLM representations and human brain activity patterns, particularly in areas responsible for processing faces, scenes, and bodies [23][24]. Group 4: Implications and Future Directions - The research has broad applications, including the development of neuro-aligned AI systems, exploration of neural mechanisms for concept combination and reasoning, and enhancement of brain-computer interface systems [35]. - Future work will focus on expanding to next-generation multimodal models and establishing a cognitive benchmark testing platform to objectively assess AI's semantic understanding [35][36].