Workflow
CLIP
icon
Search documents
李飞飞的答案:大模型之后,Agent向何处去?
虎嗅APP· 2025-09-07 02:51
以下文章来源于划重点KeyPoints ,作者林易 划重点KeyPoints . 追踪全球AI科技,记录中国硬核崛起 本文来自微信公众号: 划重点KeyPoints ,作者:林易,编辑:重点君,原文标题:《李飞飞的答 案:大模型之后,Agent 向何处去?》,题图来自:视觉中国 2025年,被普遍认为是 Agent 的元年,与之相关的概念从年初至今热度持续走高,包括智能体、AI Agent、Agentic AI 等等。 而就在最近,一篇由李飞飞领衔的 Agent 重磅论文在业内引发了广泛讨论,热度居高不下。网友们 如此评价:"几乎是跪着看完的"、"太清晰,硬控了我3个小时"。 这篇长达80页的综述名为《Agent AI: Surveying the Horizons of Multimodal Interaction》,由李飞 飞等14位来自斯坦福大学和微软的专家联合撰写。 并且,虽然这篇论文最早发表于去年年底,但站在当下节点回顾今年 Agent 的发展,谷歌、OpenAI 和微软等主流玩家的核心打法,几乎都是按照论文给出的能力栈来推进的;这也反过来印证了论文 对"从大模型到 Agent"这一演进路径的前瞻性 ...
李飞飞的答案:大模型之后,Agent 向何处去?
创业邦· 2025-09-05 11:12
以下文章来源于划重点KeyPoints ,作者林易 图源丨TED 划重点: 1、 李飞飞 最新论文, 为当下火热的 Agent 划定了边界、确立了范式。谷歌、 OpenAI 和微软等巨 头的最新布局,几乎都遵循了论文给出的能力栈。 2、 论文提出了一套完整的认知闭环架构 —— 从感知、认知、行动,到学习与记忆,构成动态迭代 的智能体体系。这不仅是技术的整合,更是对未来 AGI 路径的系统性构想。 划重点KeyPoints . 追踪全球AI科技,记录中国硬核崛起 来源丨划重点KeyPoints(ID: huazhongdian123 ) 作者丨林易 编辑丨 重点君 3、 大模型是驱动 Agent 的核心引擎,但环境交互是解决幻觉和偏见的关键锚点。论文强调, LLM/VLM 提供认知能力,但必须通过真实或模拟环境的反馈来校准现实,减少幻觉,并引入伦理与 安全机制。 4、 应用潜力横跨游戏、机器人和医疗三大前沿领域 —— 游戏中的沉浸式 NPC 、机器人中的自主 规划与物理操作、医疗中的智能问诊与健康管理,展现了 Agent 从理论走向实践的清晰路径。 2025 年,被普遍认为是 Agent 的元年,与之相关的概念 ...
李飞飞的答案:大模型之后,Agent向何处去?
Hu Xiu· 2025-09-05 00:34
Core Insights - The article discusses the rising prominence of Agent AI, with 2025 being viewed as a pivotal year for this technology [1][2] - A significant paper led by Fei-Fei Li titled "Agent AI: Surveying the Horizons of Multimodal Interaction" has sparked extensive discussion in the industry [3][6] Summary by Sections Overview of the Paper - The paper, consisting of 80 pages, provides a clear framework for the somewhat chaotic field of Agent AI, integrating various technological strands into a new multimodal perspective [5][6] - It emphasizes the evolution from large models to agents, reflecting the current strategies of major players like Google, OpenAI, and Microsoft [6] New Paradigm of Agent AI - The paper introduces a novel cognitive architecture for Agent AI, which is not merely a compilation of existing technologies but a forward-thinking approach to the development of Artificial General Intelligence (AGI) [9] - It defines five core modules: Environment and Perception, Cognition, Action, Learning, and Memory, which together form an interactive cognitive loop [10][26] Core Modules Explained - **Environment and Perception**: Agents actively perceive information from their surroundings in a multimodal manner, incorporating various data types [12][13] - **Cognition**: Acts as the processing center for agents, enabling complex activities such as reasoning and empathy [15][16] - **Action**: Converts cognitive decisions into specific operational commands, affecting both physical and virtual environments [18][19] - **Learning**: Highlights the continuous learning and self-evolution capabilities of agents through various mechanisms [20][21] - **Memory**: Offers a structured system for long-term knowledge retention, allowing agents to leverage past experiences for new tasks [23][24] Role of Large Models - The framework's feasibility is attributed to the maturity of large foundational models, particularly LLMs and VLMs, which provide essential cognitive capabilities for agents [28][29] - These models enable agents to decompose vague instructions into actionable tasks, significantly reducing the complexity of task programming [30][31] Challenges and Ethical Considerations - The paper identifies the issue of "hallucination" in models, where they may generate inaccurate content, posing risks in real-world interactions [32][33] - It emphasizes the need for inclusivity in designing Agent AI, addressing biases in training data and ensuring ethical interactions [36][39] - The importance of establishing regulatory frameworks for data privacy and security in Agent AI applications is also highlighted [38][39] Application Potential - The paper explores the vast application potential of Agent AI in gaming, robotics, and healthcare [40] - In gaming, Agent AI can create dynamic NPCs that interact meaningfully with players, enhancing immersion [42][43] - In robotics, agents can autonomously execute complex tasks based on simple verbal commands, streamlining user interaction [48][49] - In healthcare, Agent AI can assist in preliminary diagnostics and patient monitoring, improving efficiency in resource-limited settings [54][57] Future Directions - The paper acknowledges that Agent AI is still in its early stages, facing challenges in integrating multiple modalities and creating general-purpose agents [58][60] - It proposes new evaluation benchmarks to measure agent intelligence and guide future research [61]
OpenAI提出的CLIP,被Meta联合谢赛宁、刘壮,扩展到全球300+语言
机器之心· 2025-07-31 05:11
Core Viewpoint - The article discusses the introduction of MetaCLIP 2, a novel method for training the CLIP model on a global scale without relying on external resources, addressing the challenges of multilingual data processing and enhancing model performance across languages [2][4]. Group 1: MetaCLIP 2 Overview - MetaCLIP 2 is the first method to train CLIP from scratch on native global image-text pairs, overcoming the limitations of previous models that primarily focused on English data [2][5]. - The method includes three core innovations: metadata expansion to over 300 languages, a data filtering algorithm to balance concept distribution across languages, and a global training framework that proportionally increases the use of image-text pairs as non-English data is introduced [5][20]. Group 2: Performance Improvements - MetaCLIP 2 demonstrates that non-English data can enhance the capabilities of English models and vice versa, effectively breaking the "multilingual curse" [10][31]. - The model achieved state-of-the-art (SOTA) results in various multilingual benchmarks, including improvements of 3.8% on Babel-ImageNet and 1.1% on XM3600, among others [32][34]. Group 3: Training Methodology - The training framework of MetaCLIP 2 maintains consistency with OpenAI's CLIP architecture while introducing key components such as a multilingual text tokenizer and scaling of seen training pairs [26][30]. - The model's training data was expanded from 13 billion pairs to 29 billion pairs, resulting in significant performance enhancements across both English and multilingual tasks [38][39]. Group 4: Cultural and Linguistic Diversity - MetaCLIP 2 retains a comprehensive distribution of global images, enhancing geographical localization and regional recognition capabilities [13][15]. - The model directly learns from image descriptions written by native speakers, avoiding reliance on machine translation, which improves the authenticity and accuracy of the training data [12][16].
多模态大语言模型(LLM) 和视频语言预训练的关键进展、应用、数据集和方法
3 6 Ke· 2025-07-23 02:45
Core Insights - The article discusses the recent advancements in large-scale video language pre-training tasks, focusing on representation learning using weakly labeled subtitles and videos [1][2]. Group 1: Introduction - The task of video language pre-training employs weak subtitles and videos for representation learning, utilizing a standard learning paradigm of pre-training and fine-tuning [2]. - Pre-training typically involves self-supervised learning on large datasets, while fine-tuning is conducted on smaller datasets for specific tasks, reducing the need for training new models for different tasks [2]. Group 2: Recent Developments and Applications - The importance of dataset size for representation learning is emphasized, with researchers utilizing large, weakly labeled cross-modal data from the internet, leading to a surge in cross-modal task research [3]. - Significant progress in visual language pre-training is highlighted by the Contrastive Language-Image Pre-training (CLIP) model, which learns multimodal representations from weakly supervised data [3]. - Large video datasets like Howto100M, containing 136 million narrated videos, have been introduced, facilitating advancements in video language pre-training and opening new avenues for video understanding tasks [3]. Group 3: Open Video Language Pre-training Datasets - The scale and quality of pre-training datasets are crucial for learning robust visual representations, especially for Transformer-based models [6]. - Key datasets include: - Kinetics: A large-scale action recognition dataset with up to 650,000 video clips across various human action categories [7]. - ActivityNet Captions: Contains 20,000 videos with 100,000 unique descriptions [8]. - Howto100M: A large narrated video dataset with over 136 million video clips [8]. - WebVid: Contains over 2 million weakly labeled videos [8]. - HD-VILA: The first high-resolution dataset with 100 million video clips [8]. Group 4: Video Language Pre-training Methods - Recent methods primarily use Transformer as a feature extractor for learning from large-scale multimodal data, categorized into single-stream and two-stream approaches [10]. - Single-stream methods include VideoBERT, HERO, and VATT, focusing on encoding multimodal inputs [10][11]. - Two-stream methods like CBT and UniVL provide greater flexibility by separately extracting features from different modalities [11].
ICCV 2025|训练太复杂?对图片语义、布局要求太高?图像morphing终于一步到位
机器之心· 2025-07-18 00:38
Core Viewpoint - The article introduces FreeMorph, a novel training-free image morphing method that enables high-quality and smooth transitions between two input images without the need for pre-training or additional annotations [5][32]. Group 1: Background and Challenges - Image morphing is a creative task that allows for smooth transitions between two distinct images, commonly seen in animations and photo editing [3]. - Traditional methods relied on complex algorithms and faced challenges with high training costs, data dependency, and instability in real-world applications [4]. - Recent advancements in deep learning methods like GANs and VAEs have improved image morphing but still struggle with training costs and adaptability [4][5]. Group 2: FreeMorph Methodology - FreeMorph addresses the challenges of image morphing by eliminating the need for training, achieving effective morphing with just two images [5]. - The method incorporates two key innovations: spherical feature aggregation and prior-driven self-attention mechanisms, enhancing the model's ability to maintain identity features and ensure smooth transitions [11][32]. - A step-oriented motion flow is introduced to control the transition direction, allowing for a coherent and gradual morphing process [21][32]. Group 3: Experimental Results - FreeMorph has been evaluated against existing methods, demonstrating superior performance in generating high-fidelity results across diverse scenarios, including images with varying semantics and layouts [27][30]. - The method effectively captures subtle changes, such as color variations in objects or nuanced facial expressions, showcasing its versatility [27][30]. Group 4: Limitations - Despite its advancements, FreeMorph has limitations, particularly when handling images with significant semantic or layout differences, which may result in less smooth transitions [34]. - The method inherits biases from the underlying Stable Diffusion model, affecting accuracy in specific contexts, such as human limb structures [34].
被 AI 大厂逼至绝望,这帮欧洲人发起了一场“科学复兴运动”
AI科技大本营· 2025-06-24 07:45
Core Viewpoint - The article discusses the emergence of LAION as a response to the increasing centralization and opacity in the field of artificial intelligence, emphasizing the need for open datasets and reproducibility in research [7][25]. Group 1: Emergence of LAION - LAION was founded to combat the trend of AI research being locked in "black boxes" controlled by a few tech giants, which hinders scientific reproducibility [2][7]. - The initiative began with Christoph Schuhmann's idea to create a dataset from Common Crawl, leading to the formation of a collaborative network of scientists and enthusiasts [3][4]. - The organization is defined by its commitment to being 100% non-profit and free, aiming to "liberate machine learning research" [3][4]. Group 2: Collaboration and Resources - The collaboration between LAION and top-tier computing resources allowed for the reproduction and even surpassing of models locked in proprietary systems [4][5]. - Key figures from various backgrounds, including academia and industry, joined LAION, contributing to its mission and enhancing its research capabilities [5][10]. - The organization has successfully released large-scale open datasets like LAION-400M and LAION-5B, which have been widely adopted in the community [16][17]. Group 3: Challenges and Achievements - The process of building reproducible datasets is complex and requires significant effort, including data collection and quality assurance [28][31]. - Despite initial expectations of mediocrity, models trained on LAION's open datasets performed comparably or better than proprietary models, demonstrating the potential of open research [17][29]. - The transparency of open datasets allows for the identification and rectification of issues, enhancing the overall quality of research outputs [30][31]. Group 4: The Future of AI Research - The article highlights the importance of open data and reproducibility in advancing AI research, suggesting that a collaborative approach can lead to significant breakthroughs [25][26]. - The ongoing exploration of reasoning models indicates a shift towards improving the robustness and reliability of AI systems, with a focus on expanding the dataset for training [41][43]. - The future of AI research may depend on the ability to create a more organized framework within the open-source community to harness collective talent and resources [45].
大模型能够自发形成“人类思维地图”!Nature子刊重磅研究揭示多模态大模型类脑机制
机器人圈· 2025-06-11 11:43
Core Viewpoint - The research published in "Nature Machine Intelligence" demonstrates that multimodal large language models (MLLMs) can develop human-like object concept representations, challenging the notion that these models merely mimic human language without true understanding [2][4]. Group 1: Research Findings - The study analyzed 4.7 million behavioral judgment data to construct an "concept map" of AI models, confirming that MLLMs can form object concept representations similar to humans [3][6]. - The research identified 66 core dimensions of cognition through a sparse positive definite similarity embedding method, revealing that both ChatGPT-3.5 and the multimodal Gemini model exhibit stable low-dimensional representation structures [9]. - MLLMs spontaneously formed 18 high-level object concept categories with a classification accuracy of 78.3%, approaching human accuracy of 87.1% [13]. Group 2: Methodology - The research employed a novel "behavioral cognitive probe" method, integrating computational modeling, behavioral experiments, and neuroscience to analyze AI cognition [8]. - A triplet odd-one-out task was designed to assess the similarity of object representations between AI and humans, allowing for a comparative analysis of decision-making processes [5][31]. Group 3: Cognitive Dimensions - The study provided semantic labels for the cognitive dimensions of AI models, categorizing them into dimensions related to semantic categories, perceptual features, and physical components [17][19][20]. - The findings indicated a significant correlation between MLLM representations and human brain activity patterns, particularly in areas responsible for processing faces, scenes, and bodies [23][24]. Group 4: Implications and Future Directions - The research has broad applications, including the development of neuro-aligned AI systems, exploration of neural mechanisms for concept combination and reasoning, and enhancement of brain-computer interface systems [35]. - Future work will focus on expanding to next-generation multimodal models and establishing a cognitive benchmark testing platform to objectively assess AI's semantic understanding [35][36].
Mary Meeker:AI采纳现状如何?
Sou Hu Cai Jing· 2025-06-11 02:17
Core Insights - Mary Meeker's latest report highlights the rapid growth of ChatGPT's search volume, surpassing traditional Google search in just three years, marking a significant shift in internet usage [2][3] - The report emphasizes the unprecedented speed of technological change, particularly in AI, and its global impact, contrasting it with the slower adoption rates of previous technological revolutions [4][6] AI Growth Metrics - Since 2010, the annual growth rate of AI training model data has reached 260%, while the required computational resources have grown at 360% [2] - ChatGPT's user base, subscription numbers, and revenue growth indicate its widespread adoption among internet users [3] Developer Engagement - The number of developers in the Google ecosystem has increased from 1.4 million to 7 million, a fivefold increase since last year [5] - Companies are leveraging AI developments to enhance user interactions, with a shift towards AI management roles in customer support [5] Adoption Speed Comparison - AI adoption has occurred in approximately three years, significantly faster than personal computers (20 years), desktop internet (12 years), and mobile internet (6 years) [6] Business Investment Trends - A Morgan Stanley survey indicates that 75% of global CMOs are experimenting with AI, with significant capital expenditures in AI projects, including a 21% increase in related capital spending and a 28% rise in data spending [6][7] Cost Dynamics - The report notes a "cost deflation" phenomenon, with the purchasing power for AI inference increasing tenfold annually [7] Future AI Landscape - New users will engage with AI in a native environment, free from traditional internet constraints, suggesting a transformative impact on daily life [8] Global Usage Statistics - ChatGPT usage rates are reported at 13.5% in India, 9% in the U.S., and 5% in Indonesia and Brazil [9] U.S.-China AI Competition - The report highlights China's leading position in large language model performance, with implications for national strategy and technological innovation [10] Next-Generation AI Interfaces - The transition from text to voice interfaces, and eventually to humanoid robots, is anticipated as a significant development in AI interaction [10]
2025年中国多模态大模型行业模型现状 图像、视频、音频、3D模型等终将打通和融合【组图】
Qian Zhan Wang· 2025-06-01 05:09
Core Insights - The exploration of multimodal large models is making gradual progress, with a focus on breakthroughs in visual modalities, aiming for an "Any-to-Any" model that requires successful pathways across various modalities [1] - The industry is currently concentrating on enhancing perception and generation models in image, video, and 3D modalities, with the goal of achieving cross-modal integration and sharing [1] Multimodal Large Models in Image - Prior to the rise of LLMs in 2023, the industry had already established a solid foundation in image understanding and generation, resulting in models like CLIP, Stable Diffusion, and GAN, which led to applications such as Midjourney and DALL·E [2] - The industry is actively exploring the integration of Transformer models into image-related tasks, with significant outcomes including GLIP, SAM, and GPT-V [2] Multimodal Large Models in Video - Video generation is being approached by transferring image generation models to video, utilizing image data for training and aligning temporal dimensions to achieve text-to-video results [5] - Recent advancements include models like VideoLDM and Sora, which demonstrate significant breakthroughs in video generation using the Diffusion Transformer architecture [5] Multimodal Large Models in 3D - The generation of 3D models is being explored by extending 2D image generation methods, with key models such as 3D GAN, MeshDiffusion, and Instant3D emerging in the industry [8][9] - 3D data representation includes various formats like meshes, point clouds, and NeRF, with NeRF being a critical technology for 3D data representation [9] Multimodal Large Models in Audio - AI technologies related to audio have matured, with recent applications of Transformer models enhancing audio understanding and generation, exemplified by projects like Whisper large-v3 and VALL-E [11] - The evolution of speech technology is categorized into three stages, with a focus on enhancing generalization capabilities across multiple languages and tasks [11]