Workflow
LLaVA
icon
Search documents
从 LLaVA 到 Qwen3-VL,解构多模态大模型的演进之路
自动驾驶之心· 2025-12-09 00:03
作者 | 我要吃鸡腿 编辑 | 大模型之心Tech 原文链接: https://zhuanlan.zhihu.com/p/1963658684765833212 点击下方 卡片 ,关注" 大模型之心Tech "公众号 戳我-> 领取大模型巨卷干货 在深入探讨 LLaVA 和 Qwen3-VL 的具体实现之前,我们必须先搭建一个稳固的认知框架。幸运的是,尽管实现细节千差万别,当前绝大多数主流的 多模态大模型都遵循着一个共同的、优雅的"三位一体"黄金架构。我们可以将其生动地比喻为为 AI 打造一套完整的"感知-思考"系统: AI 的"眼睛" (视觉编码器) : 负责最前端的感知。它的任务是将输入的像素世界——无论是静态图片还是动态视频,转化为机器能够理解的、蕴含 丰富语义的数学表达(即特征向量)。 本文只做学术分享,已获转载授权 ,欢迎添加小助理微信AIDriver004做进一步咨询 引言:当 AI 睁开双眼,我们看到了一个怎样的未来? 曾几何时,我们对人工智能的印象还停留在那个聪慧但略显"盲目"的"数字大脑"上——它能写诗、能编程、能回答深奥的哲学问题,但这一切都局限 于冰冷的文本世界。然而,就在最近两年,一场 ...
从 LLaVA 到 Qwen3-VL:解构多模态大模型的演进之路
自动驾驶之心· 2025-12-08 00:02
作者 | 我要吃鸡腿 编辑 | 大模型之心Tech 原文链接: https://zhuanlan.zhihu.com/p/1963658684765833212 本文只做学术分享,已获转载授权 ,欢迎添加小助理微信AIDriver004做进一步咨询 点击下方 卡片 ,关注" 大模型之心Tech "公众号 戳我-> 领取大模型巨卷干货 引言:当 AI 睁开双眼,我们看到了一个怎样的未来? 曾几何时,我们对人工智能的印象还停留在那个聪慧但略显"盲目"的"数字大脑"上——它能写诗、能编程、能回答深奥的哲学问题,但这一切都局限 于冰冷的文本世界。然而,就在最近两年,一场深刻的变革正在悄然发生。 您或许已经惊叹于 GPT-5 那般流畅自如的实时图片对话,它能"看到"您房间的布局并给出整理建议;又或者,您可能对 Qwen3-VL 直接"注视"着手 机屏幕、精准地点击按钮、操作应用程序的能力感到不可思议。AI 不再仅仅是一个"只会读书"的语言模型,它正在进化成一个能听、会看、可交互 的"智能体",真正地睁开了双眼,开始感知和理解我们所处的这个五彩斑斓的物理世界。 这场从"符号"到"感知"的飞跃,背后究竟隐藏着怎样的技术密码 ...
从 LLaVA 到 Qwen3-VL:多模态大模型主流架构的演进之路
自动驾驶之心· 2025-12-03 00:04
Core Insights - The article discusses the evolution of artificial intelligence (AI) from a text-based model to a multimodal large model (MLLM) capable of perceiving and interacting with the physical world through vision and language [3][4]. - It highlights two successful technical evolution paths in MLLM: the LLaVA series, which emphasizes simplicity, and the Qwen3-VL, which focuses on deep integration [3][4]. Group 1: MLLM Architecture - MLLM follows a "trinity" architecture consisting of a visual encoder (Vision Transformer), a language model (LLM), and a connector that facilitates communication between the two [6][10]. - The visual encoder transforms images into mathematical representations, while the LLM processes these representations to generate coherent text responses [10][22]. - The connector acts as a bridge, translating visual features into a format that the LLM can understand, ensuring seamless integration of visual and textual information [36][37]. Group 2: Vision Transformer (ViT) - ViT revolutionizes image processing by treating images as sequences of patches, allowing the model to leverage transformer architecture for visual understanding [11][13]. - The process involves segmenting images into patches, flattening them into vectors, and adding positional information to maintain spatial context [13][16]. - ViT's multi-head attention mechanism enables the model to capture relationships between distant elements in an image, enhancing its ability to understand complex visual scenes [21][22]. Group 3: Language Model (LLM) - LLM serves as the cognitive core of MLLM, integrating visual and textual information to generate contextually relevant responses [22][23]. - The input to LLM is a combined sequence of visual and language tokens, allowing for a comprehensive understanding of the context [24][25]. - LLM employs autoregressive generation to predict the next token based on the entire context, facilitating coherent and contextually appropriate outputs [26][30]. Group 4: Connector Design - The connector's design is crucial for bridging the gap between visual and textual modalities, with two main approaches: the minimalist approach of LLaVA and the more complex Q-Former used in BLIP-2 [38][40]. - LLaVA's connector is a simple linear transformation that relies on the LLM's strength to learn the mapping between modalities [40][41]. - Q-Former, on the other hand, actively extracts and refines key information from visual features before passing them to the LLM, enhancing efficiency and reducing computational load [42][53]. Group 5: Challenges and Solutions - The article addresses the challenge of processing high-resolution images without overwhelming the model's computational capacity, leading to the exploration of different design philosophies [64]. - LLaVA's AnyRes solution allows the model to handle images of arbitrary resolutions by focusing on preprocessing techniques rather than restructuring the model [65].
李飞飞的答案:大模型之后,Agent向何处去?
虎嗅APP· 2025-09-07 02:51
Core Viewpoint - The article discusses the emergence of Agent AI, highlighting its potential to revolutionize various fields through a new cognitive architecture that integrates perception, cognition, action, learning, and memory [4][9][10]. Summary by Sections Introduction to Agent AI - 2025 is anticipated to be the year of Agent AI, with increasing interest in concepts like AI Agents and Agentic AI [4]. - A significant paper led by Fei-Fei Li titled "Agent AI: Surveying the Horizons of Multimodal Interaction" has sparked widespread discussion in the industry [4][6]. Framework of Agent AI - The paper establishes a clear framework for Agent AI, integrating various technologies into a unified perspective [6][7]. - It outlines five core modules: Environment and Perception, Cognition, Action, Learning, and Memory, which together form a dynamic cognitive loop [10][12][14][16][17]. Core Modules Explained - **Environment and Perception**: Agents actively perceive information from their surroundings, incorporating task planning and skill observation [12]. - **Cognition**: Acts as the processing center, utilizing large language models (LLMs) and visual language models (VLMs) for reasoning and strategy formulation [14]. - **Action**: Converts cognitive decisions into executable commands, affecting the environment [15]. - **Learning**: Emphasizes continuous learning through various mechanisms, allowing agents to adapt based on feedback [16]. - **Memory**: Features a structured system for long-term knowledge retention, enabling agents to leverage past experiences [17]. Role of Large Models - The development of Agent AI is driven by the maturity of foundation models, particularly LLMs and VLMs, which provide agents with extensive knowledge and planning capabilities [20]. - The paper addresses the challenge of "hallucination" in models, emphasizing the importance of environmental interaction to mitigate this issue [21][22]. Application Potential - The paper explores Agent AI's applications in three key areas: - **Gaming**: Agent AI can create dynamic NPCs that interact meaningfully with players, enhancing immersion [24][25]. - **Robotics**: Robots can execute complex tasks based on natural language commands, improving user interaction [27]. - **Healthcare**: Agent AI can assist in preliminary diagnostics and patient monitoring, increasing efficiency in healthcare delivery [29][31]. Conclusion - The paper recognizes that Agent AI is still in its early stages, facing challenges in integrating multiple modalities and creating general agents for diverse applications [32]. - It proposes new evaluation benchmarks to guide the development and measure progress in the field [32].
李飞飞的答案:大模型之后,Agent 向何处去?
创业邦· 2025-09-05 11:12
Core Insights - The article discusses a significant paper led by Fei-Fei Li that establishes a clear framework for the emerging field of Agent AI, outlining its capabilities and potential applications [5][6][9] - The paper presents a comprehensive cognitive architecture for Agent AI, consisting of five core modules: Environment and Perception, Cognition, Action, Learning, and Memory, which together form a dynamic and iterative closed-loop system [11][12][18] Summary by Sections Agent AI Framework - The new Agent AI paradigm is not merely a combination of existing technologies but represents a forward-thinking approach to the development of Artificial General Intelligence (AGI) [12] - The framework integrates various technological strands, including dialogue models, visual-language models, and reinforcement learning, into a unified perspective on multimodal agents [9][12] Core Modules of Agent AI - **Environment and Perception**: This module allows agents to actively perceive information from the physical or virtual world, incorporating task planning and skill observation [13] - **Cognition**: Defined as the processing center of the agent, this module utilizes large language models (LLMs) and visual-language models (VLMs) to interpret sensory information and develop strategies [14] - **Action**: This module generates specific operational commands based on cognitive decisions, enabling interaction with both physical and virtual environments [15] - **Learning**: Emphasizes the agent's ability to continuously learn and evolve through various mechanisms, including reinforcement learning and imitation learning [16] - **Memory**: Unlike traditional models, this module provides a structured and persistent memory system that allows agents to leverage past experiences for future tasks [17][18] Role of Large Models - Large foundational models, particularly LLMs and VLMs, serve as the cognitive backbone of Agent AI, enabling agents to perform complex tasks with minimal predefined rules [20] - The paper highlights the challenge of "hallucination," where models generate inaccurate content, and proposes environmental interaction as a solution to mitigate this issue [21] Ethical and Regulatory Considerations - The article stresses the importance of inclusivity and ethical considerations in the design of Agent AI, advocating for diverse training data and bias detection mechanisms [22] - It also addresses the need for clear regulations and frameworks to ensure data privacy and security, especially in sensitive applications [22] Application Potential - **Gaming**: Agent AI can revolutionize non-player character (NPC) behavior, allowing for dynamic interactions and personalized experiences in gaming environments [25][26] - **Robotics**: Agents can autonomously plan and execute complex physical tasks based on natural language commands, enhancing user interaction with robots [28] - **Healthcare**: Agent AI can assist in preliminary medical consultations and patient monitoring, significantly improving healthcare delivery, especially in resource-limited settings [30][32] Future Directions - The article acknowledges that Agent AI is still in its early stages and faces challenges in achieving deep integration across various modalities and domains [33] - It emphasizes the need for standardized evaluation metrics to assess agent intelligence and guide future research [33]
李飞飞的答案:大模型之后,Agent向何处去?
Hu Xiu· 2025-09-05 00:34
Core Insights - The article discusses the rising prominence of Agent AI, with 2025 being viewed as a pivotal year for this technology [1][2] - A significant paper led by Fei-Fei Li titled "Agent AI: Surveying the Horizons of Multimodal Interaction" has sparked extensive discussion in the industry [3][6] Summary by Sections Overview of the Paper - The paper, consisting of 80 pages, provides a clear framework for the somewhat chaotic field of Agent AI, integrating various technological strands into a new multimodal perspective [5][6] - It emphasizes the evolution from large models to agents, reflecting the current strategies of major players like Google, OpenAI, and Microsoft [6] New Paradigm of Agent AI - The paper introduces a novel cognitive architecture for Agent AI, which is not merely a compilation of existing technologies but a forward-thinking approach to the development of Artificial General Intelligence (AGI) [9] - It defines five core modules: Environment and Perception, Cognition, Action, Learning, and Memory, which together form an interactive cognitive loop [10][26] Core Modules Explained - **Environment and Perception**: Agents actively perceive information from their surroundings in a multimodal manner, incorporating various data types [12][13] - **Cognition**: Acts as the processing center for agents, enabling complex activities such as reasoning and empathy [15][16] - **Action**: Converts cognitive decisions into specific operational commands, affecting both physical and virtual environments [18][19] - **Learning**: Highlights the continuous learning and self-evolution capabilities of agents through various mechanisms [20][21] - **Memory**: Offers a structured system for long-term knowledge retention, allowing agents to leverage past experiences for new tasks [23][24] Role of Large Models - The framework's feasibility is attributed to the maturity of large foundational models, particularly LLMs and VLMs, which provide essential cognitive capabilities for agents [28][29] - These models enable agents to decompose vague instructions into actionable tasks, significantly reducing the complexity of task programming [30][31] Challenges and Ethical Considerations - The paper identifies the issue of "hallucination" in models, where they may generate inaccurate content, posing risks in real-world interactions [32][33] - It emphasizes the need for inclusivity in designing Agent AI, addressing biases in training data and ensuring ethical interactions [36][39] - The importance of establishing regulatory frameworks for data privacy and security in Agent AI applications is also highlighted [38][39] Application Potential - The paper explores the vast application potential of Agent AI in gaming, robotics, and healthcare [40] - In gaming, Agent AI can create dynamic NPCs that interact meaningfully with players, enhancing immersion [42][43] - In robotics, agents can autonomously execute complex tasks based on simple verbal commands, streamlining user interaction [48][49] - In healthcare, Agent AI can assist in preliminary diagnostics and patient monitoring, improving efficiency in resource-limited settings [54][57] Future Directions - The paper acknowledges that Agent AI is still in its early stages, facing challenges in integrating multiple modalities and creating general-purpose agents [58][60] - It proposes new evaluation benchmarks to measure agent intelligence and guide future research [61]
多模态大模型存在「内心预警」,无需训练,就能识别越狱攻击
机器之心· 2025-07-21 08:43
Core Viewpoint - The rise of multimodal large models (LVLMs) has led to significant advancements in tasks such as image-text question answering and visual reasoning, but they are more susceptible to "jailbreaking" attacks compared to pure text models [2][5]. Group 1: Multimodal Model Security Challenges - LVLMs, such as GPT-4V and LLaVA, integrate images and text, enhancing their capabilities but also exposing them to security vulnerabilities [2]. - Existing methods to enhance model security, including cross-modal safety fine-tuning and external discriminator modules, face challenges such as high training costs and poor generalization [3]. Group 2: HiddenDetect Methodology - Researchers from CUHK MMLab and Taotian Group introduced HiddenDetect, a novel jailbreak detection method that does not require training [5]. - The core finding is that LVLMs retain rejection signals in their hidden states even when they generate inappropriate content, particularly in intermediate layers [5][9]. Group 3: Analysis of Rejection Signals - The study constructs a "rejection semantic vector" (RV) from frequently occurring tokens that indicate refusal, allowing for the measurement of rejection signal strength across model layers [9]. - Experimental results show significant differences in rejection signal strength between safe and unsafe inputs, with intermediate layers being more sensitive to safety concerns [9][10]. Group 4: Input Type Sensitivity - The analysis reveals that different input modalities activate distinct safety pathways, with text inputs showing quicker rejection signal activation compared to image-text inputs [17][19]. - The presence of visual modalities can delay the model's rejection response, weakening its safety mechanisms [19]. Group 5: Experimental Results and Effectiveness - The HiddenDetect method was evaluated across multiple mainstream LVLMs, demonstrating robust performance against various attack types while maintaining good generalization capabilities [23]. - The method achieved high detection effectiveness, with the proposed approach outperforming existing methods in terms of robustness and generalization [24]. Group 6: Future Directions - The research emphasizes the importance of safety in deploying large models in real-world applications and aims to expand the capabilities of the detection method while exploring the relationship between modality information and model safety [28].
2025年中国多模态大模型行业市场规模、产业链、竞争格局分析及行业发趋势研判:将更加多元和深入,应用前景越来越广阔[图]
Chan Ye Xin Xi Wang· 2025-05-29 01:47
Core Insights - The multi-modal large model market in China is projected to reach 15.63 billion yuan in 2024, an increase of 6.54 billion yuan from 2023, and is expected to grow to 23.48 billion yuan in 2025, indicating strong market demand and government support [1][6][19] Multi-Modal Large Model Industry Definition and Classification - Multi-modal large models are AI systems capable of processing and understanding various data forms, including text, images, audio, and video, using deep learning technologies like the Transformer architecture [2][4] Industry Development History - The multi-modal large model industry has evolved through several stages: task-oriented phase, visual-language pre-training phase, and the current multi-modal large model phase, focusing on enhancing cross-modal understanding and generation capabilities [4] Current Industry Status - The multi-modal large model industry has gained significant attention due to its data processing capabilities and diverse applications, with a market size projected to grow substantially in the coming years [6][19] Application Scenarios - The largest application share of multi-modal large models is in the digital human sector at 24%, followed by gaming and advertising at 13% each, and smart marketing and social media at 10% each [8] Industry Value Chain - The industry value chain consists of upstream components like AI chips and hardware, midstream multi-modal large models, and downstream applications across various sectors including education, gaming, and public services [10][12] Competitive Landscape - Major players in the multi-modal large model space include institutions and companies like the Chinese Academy of Sciences, Huawei, Baidu, Tencent, and Alibaba, with various models being developed to optimize training costs and enhance capabilities [16][17] Future Development Trends - The multi-modal large model industry is expected to become more intelligent and humanized, providing richer and more personalized user experiences, with applications expanding across various fields such as finance, education, and content creation [19]
用多模态LLM超越YOLOv3!强化学习突破多模态感知极限|开源
量子位· 2025-05-03 04:05
Core Insights - The article discusses the introduction of Perception-R1 (PR1), a groundbreaking multimodal large language model (MLLM) that surpasses previous models like YOLOv3 and Faster-RCNN by achieving over 30 AP on the COCO2017 validation set [1][16]. Group 1: Introduction of Perception-R1 - Perception-R1 is developed by research teams from Huazhong University of Science and Technology, Beijing University of Posts and Telecommunications, and others, focusing on enhancing visual reasoning through rule-based reinforcement learning (RL) [1][5]. - The model aims to improve capabilities in pure visual tasks such as counting and general object detection, as well as visual-language tasks like grounding and OCR [1][4]. Group 2: Importance of Visual Perception in AI - The article emphasizes the need for a revolution in AI visual perception, highlighting the rapid advancements in AI's ability to understand visual information, which is crucial for applications ranging from autonomous driving to medical diagnostics [3][4]. - It points out the subtle differences between recognizing objects and understanding their interactions in detail, indicating that current MLLMs often struggle with complex visual reasoning tasks [4]. Group 3: Role of Reinforcement Learning - The rise of reinforcement learning, particularly techniques like RLHF (Reinforcement Learning from Human Feedback) and rule-based RL, is noted as a transformative factor for language models, prompting the development of Perception-R1 [5][6]. - The article raises the question of whether RL can similarly enhance MLLM's visual perception capabilities, suggesting that early attempts have shown promise but not universal success [6]. Group 4: Perception-R1 Framework - Perception-R1 is not a new MLLM from scratch but a post-training framework designed to significantly enhance the visual perception abilities of existing capable MLLMs [7]. - It employs a technique called Group Relative Policy Optimization (GRPO) to optimize the perception strategy, which is crucial for improving visual task performance [9]. Group 5: Reward Engineering - The article discusses the importance of reward modeling in reinforcement learning, where the reward function guides the learning process by quantifying the model's performance on visual tasks [11]. - Perception-R1's reward structure includes extracting relevant visual details, executing logical operations based on visual understanding, and generating outputs in the correct format [11][17]. Group 6: Experimental Results - Perception-R1's performance is evaluated against strong benchmarks and specialized models, demonstrating significant improvements in visual counting and object detection tasks [16][19]. - For instance, in visual counting tasks, Perception-R1 achieved 78.1 on Pixmo-Count, outperforming other models [19]. Group 7: Scalability and Future Implications - The article concludes that Perception-R1 lays a critical foundation for future advancements in intelligent AI visual perception, suggesting that its principles could play a key role in developing next-generation perceptual AI systems [24][25].
10倍吞吐提升无损性能:多模态适用的KV cache量化策略来了,即插即用无需改原模型
量子位· 2025-04-03 02:12
CalibQuant团队 投稿 量子位 | 公众号 QbitAI 在InternVL-2.5上实现 10倍吞吐量提升 ,模型性能几乎无损失。 最新1-bit多模态大模型KV cache量化方案 CalibQuant 来了。 通过结合后缩放和校准方法,可显著降低显存与计算成本, 无需改动原模 型即可直接使用 。 即插即用、无缝集成 多模态大语言模型在各种应用中展现出了卓越的性能。然而,它们在部署过程中的计算开销仍然是一个关键瓶颈。 虽然KV cache通过用显存换计算在一定程度上提高了推理效率,但随着KV cache的增大,显存占用不断增加,吞吐量受到了极大限制。 为了解决这一挑战,作者提出了CalibQuant,一种简单却高效的视觉KV cache量化策略,能够大幅降低显存和计算开销。具体来说, CalibQuant引入了一种极端的1比特量化方案, 采用了针对视觉KV cache内在模式设计的后缩放和校准技术,在保证高效性的同时,不牺牲 模型性能。 作者通过利用Triton进行runtime优化,在InternVL-2.5模型上实现了10倍的吞吐量提升。这一方法具有即插即用的特性,能够无缝集成到各 种现有的多 ...