Workflow
LLaVA
icon
Search documents
从 LLaVA 到 Qwen3-VL,解构多模态大模型的演进之路
自动驾驶之心· 2025-12-09 00:03
Core Viewpoint - The article discusses the evolution of artificial intelligence (AI) from a text-based model to a multimodal large model (MLLM) capable of perceiving and interacting with the physical world through vision and language [1][2]. Group 1: MLLM Architecture - The MLLM architecture is described as a "trinity" system consisting of three core components: the visual encoder (Vision Transformer), the language model (LLM), and the connector [3][5]. - The visual encoder transforms images into mathematical representations, enabling AI to "see" and understand visual information [5][17]. - The LLM serves as the cognitive center, integrating visual features with textual instructions to generate coherent responses [17][20]. - The connector acts as a bridge, translating visual features into a format that the LLM can understand, thus facilitating seamless communication between the two modalities [32][33]. Group 2: Vision Transformer (ViT) - ViT revolutionizes image processing by treating images as sequences of patches, allowing the model to leverage transformer architecture for visual understanding [7][9]. - The process involves several steps: image patching, flattening and linear projection, adding positional information, and processing through a transformer encoder [9][10][15]. - ViT's ability to encode spatial relationships using rotary position embedding enhances its understanding of image context [13][14]. Group 3: Language Model (LLM) - The LLM processes a combined sequence of visual and textual information, allowing for a richer context in generating responses [20][31]. - It employs a multi-head attention mechanism to capture relationships between visual tokens and textual tokens, enhancing its ability to understand complex queries [19][24]. - The LLM's architecture is evolving towards a mixture of experts (MoE) model, which allows for more efficient processing by activating only a subset of parameters during inference [28][31]. Group 4: Connector Mechanism - The connector's role is crucial in aligning the visual and textual modalities, ensuring that the LLM can effectively interpret the visual features provided by the ViT [32][34]. - There are two main design philosophies for connectors: the minimalist approach exemplified by LLaVA, which relies on a simple linear transformation, and the more sophisticated Q-Former, which actively extracts key information from visual features [36][38]. - Q-Former utilizes learnable queries and cross-attention mechanisms to distill essential information from the visual input, reducing the cognitive load on the LLM [42][45]. Group 5: Challenges and Future Directions - The article highlights the challenge of processing high-resolution images without overwhelming the LLM's computational capacity, leading to the exploration of different architectural solutions [54]. - The need for models to efficiently handle complex visual data while maintaining performance is a key focus for future developments in MLLM technology [54].
从 LLaVA 到 Qwen3-VL:解构多模态大模型的演进之路
自动驾驶之心· 2025-12-08 00:02
Core Insights - The article discusses the evolution of artificial intelligence (AI) from a text-based model to a multimodal large model (MLLM) capable of perceiving and interacting with the physical world through vision and language [1][2]. Group 1: MLLM Architecture - The MLLM architecture is described as a "trinity" system consisting of three core components: the visual encoder (Vision Transformer), the language model (LLM), and the connector [3][5]. - The visual encoder transforms pixel data from images or videos into mathematical representations that the model can understand [5][6]. - The LLM serves as the cognitive center, integrating visual features with textual instructions to generate coherent responses [17][20]. - The connector acts as a bridge, translating visual information into a format that the LLM can process, ensuring seamless communication between the two modalities [32][33]. Group 2: Vision Transformer (ViT) - ViT revolutionizes image processing by treating images as sequences of patches, allowing the model to understand visual data similarly to how it processes text [7][9]. - The process involves several steps: image patching, flattening and linear projection, adding contextual information, and processing through a transformer encoder [10][15]. - ViT's ability to understand spatial relationships is enhanced through techniques like Rotary Position Embedding (RoPE), which encodes positional information dynamically [14][13]. Group 3: Language Model (LLM) - The LLM is responsible for generating responses based on the integrated multimodal input, utilizing mechanisms like multi-head attention and feed-forward networks to process information [19][24]. - The input to the LLM is a combined sequence of visual and textual tokens, allowing for a richer understanding of context [20][22]. - The LLM employs autoregressive generation to predict the next token based on the entire context, iteratively building responses [24][25]. Group 4: Connector Mechanism - The connector's role is crucial in aligning the visual features from ViT with the LLM's understanding, effectively bridging the modality gap [32][34]. - There are two main design philosophies for connectors: the minimalist approach exemplified by LLaVA, which relies on a simple linear transformation, and the more complex Q-Former, which actively extracts key information from visual data [36][38]. - Q-Former utilizes learnable queries and cross-attention mechanisms to distill essential information from the visual input before it reaches the LLM, enhancing efficiency and reducing computational load [42][45]. Group 5: Challenges and Future Directions - The article highlights the challenge of processing high-resolution visual information without overwhelming the LLM's computational capacity, leading to the exploration of different architectural approaches [54].
从 LLaVA 到 Qwen3-VL:多模态大模型主流架构的演进之路
自动驾驶之心· 2025-12-03 00:04
Core Insights - The article discusses the evolution of artificial intelligence (AI) from a text-based model to a multimodal large model (MLLM) capable of perceiving and interacting with the physical world through vision and language [3][4]. - It highlights two successful technical evolution paths in MLLM: the LLaVA series, which emphasizes simplicity, and the Qwen3-VL, which focuses on deep integration [3][4]. Group 1: MLLM Architecture - MLLM follows a "trinity" architecture consisting of a visual encoder (Vision Transformer), a language model (LLM), and a connector that facilitates communication between the two [6][10]. - The visual encoder transforms images into mathematical representations, while the LLM processes these representations to generate coherent text responses [10][22]. - The connector acts as a bridge, translating visual features into a format that the LLM can understand, ensuring seamless integration of visual and textual information [36][37]. Group 2: Vision Transformer (ViT) - ViT revolutionizes image processing by treating images as sequences of patches, allowing the model to leverage transformer architecture for visual understanding [11][13]. - The process involves segmenting images into patches, flattening them into vectors, and adding positional information to maintain spatial context [13][16]. - ViT's multi-head attention mechanism enables the model to capture relationships between distant elements in an image, enhancing its ability to understand complex visual scenes [21][22]. Group 3: Language Model (LLM) - LLM serves as the cognitive core of MLLM, integrating visual and textual information to generate contextually relevant responses [22][23]. - The input to LLM is a combined sequence of visual and language tokens, allowing for a comprehensive understanding of the context [24][25]. - LLM employs autoregressive generation to predict the next token based on the entire context, facilitating coherent and contextually appropriate outputs [26][30]. Group 4: Connector Design - The connector's design is crucial for bridging the gap between visual and textual modalities, with two main approaches: the minimalist approach of LLaVA and the more complex Q-Former used in BLIP-2 [38][40]. - LLaVA's connector is a simple linear transformation that relies on the LLM's strength to learn the mapping between modalities [40][41]. - Q-Former, on the other hand, actively extracts and refines key information from visual features before passing them to the LLM, enhancing efficiency and reducing computational load [42][53]. Group 5: Challenges and Solutions - The article addresses the challenge of processing high-resolution images without overwhelming the model's computational capacity, leading to the exploration of different design philosophies [64]. - LLaVA's AnyRes solution allows the model to handle images of arbitrary resolutions by focusing on preprocessing techniques rather than restructuring the model [65].
李飞飞的答案:大模型之后,Agent向何处去?
虎嗅APP· 2025-09-07 02:51
Core Viewpoint - The article discusses the emergence of Agent AI, highlighting its potential to revolutionize various fields through a new cognitive architecture that integrates perception, cognition, action, learning, and memory [4][9][10]. Summary by Sections Introduction to Agent AI - 2025 is anticipated to be the year of Agent AI, with increasing interest in concepts like AI Agents and Agentic AI [4]. - A significant paper led by Fei-Fei Li titled "Agent AI: Surveying the Horizons of Multimodal Interaction" has sparked widespread discussion in the industry [4][6]. Framework of Agent AI - The paper establishes a clear framework for Agent AI, integrating various technologies into a unified perspective [6][7]. - It outlines five core modules: Environment and Perception, Cognition, Action, Learning, and Memory, which together form a dynamic cognitive loop [10][12][14][16][17]. Core Modules Explained - **Environment and Perception**: Agents actively perceive information from their surroundings, incorporating task planning and skill observation [12]. - **Cognition**: Acts as the processing center, utilizing large language models (LLMs) and visual language models (VLMs) for reasoning and strategy formulation [14]. - **Action**: Converts cognitive decisions into executable commands, affecting the environment [15]. - **Learning**: Emphasizes continuous learning through various mechanisms, allowing agents to adapt based on feedback [16]. - **Memory**: Features a structured system for long-term knowledge retention, enabling agents to leverage past experiences [17]. Role of Large Models - The development of Agent AI is driven by the maturity of foundation models, particularly LLMs and VLMs, which provide agents with extensive knowledge and planning capabilities [20]. - The paper addresses the challenge of "hallucination" in models, emphasizing the importance of environmental interaction to mitigate this issue [21][22]. Application Potential - The paper explores Agent AI's applications in three key areas: - **Gaming**: Agent AI can create dynamic NPCs that interact meaningfully with players, enhancing immersion [24][25]. - **Robotics**: Robots can execute complex tasks based on natural language commands, improving user interaction [27]. - **Healthcare**: Agent AI can assist in preliminary diagnostics and patient monitoring, increasing efficiency in healthcare delivery [29][31]. Conclusion - The paper recognizes that Agent AI is still in its early stages, facing challenges in integrating multiple modalities and creating general agents for diverse applications [32]. - It proposes new evaluation benchmarks to guide the development and measure progress in the field [32].
李飞飞的答案:大模型之后,Agent 向何处去?
创业邦· 2025-09-05 11:12
Core Insights - The article discusses a significant paper led by Fei-Fei Li that establishes a clear framework for the emerging field of Agent AI, outlining its capabilities and potential applications [5][6][9] - The paper presents a comprehensive cognitive architecture for Agent AI, consisting of five core modules: Environment and Perception, Cognition, Action, Learning, and Memory, which together form a dynamic and iterative closed-loop system [11][12][18] Summary by Sections Agent AI Framework - The new Agent AI paradigm is not merely a combination of existing technologies but represents a forward-thinking approach to the development of Artificial General Intelligence (AGI) [12] - The framework integrates various technological strands, including dialogue models, visual-language models, and reinforcement learning, into a unified perspective on multimodal agents [9][12] Core Modules of Agent AI - **Environment and Perception**: This module allows agents to actively perceive information from the physical or virtual world, incorporating task planning and skill observation [13] - **Cognition**: Defined as the processing center of the agent, this module utilizes large language models (LLMs) and visual-language models (VLMs) to interpret sensory information and develop strategies [14] - **Action**: This module generates specific operational commands based on cognitive decisions, enabling interaction with both physical and virtual environments [15] - **Learning**: Emphasizes the agent's ability to continuously learn and evolve through various mechanisms, including reinforcement learning and imitation learning [16] - **Memory**: Unlike traditional models, this module provides a structured and persistent memory system that allows agents to leverage past experiences for future tasks [17][18] Role of Large Models - Large foundational models, particularly LLMs and VLMs, serve as the cognitive backbone of Agent AI, enabling agents to perform complex tasks with minimal predefined rules [20] - The paper highlights the challenge of "hallucination," where models generate inaccurate content, and proposes environmental interaction as a solution to mitigate this issue [21] Ethical and Regulatory Considerations - The article stresses the importance of inclusivity and ethical considerations in the design of Agent AI, advocating for diverse training data and bias detection mechanisms [22] - It also addresses the need for clear regulations and frameworks to ensure data privacy and security, especially in sensitive applications [22] Application Potential - **Gaming**: Agent AI can revolutionize non-player character (NPC) behavior, allowing for dynamic interactions and personalized experiences in gaming environments [25][26] - **Robotics**: Agents can autonomously plan and execute complex physical tasks based on natural language commands, enhancing user interaction with robots [28] - **Healthcare**: Agent AI can assist in preliminary medical consultations and patient monitoring, significantly improving healthcare delivery, especially in resource-limited settings [30][32] Future Directions - The article acknowledges that Agent AI is still in its early stages and faces challenges in achieving deep integration across various modalities and domains [33] - It emphasizes the need for standardized evaluation metrics to assess agent intelligence and guide future research [33]
李飞飞的答案:大模型之后,Agent向何处去?
Hu Xiu· 2025-09-05 00:34
Core Insights - The article discusses the rising prominence of Agent AI, with 2025 being viewed as a pivotal year for this technology [1][2] - A significant paper led by Fei-Fei Li titled "Agent AI: Surveying the Horizons of Multimodal Interaction" has sparked extensive discussion in the industry [3][6] Summary by Sections Overview of the Paper - The paper, consisting of 80 pages, provides a clear framework for the somewhat chaotic field of Agent AI, integrating various technological strands into a new multimodal perspective [5][6] - It emphasizes the evolution from large models to agents, reflecting the current strategies of major players like Google, OpenAI, and Microsoft [6] New Paradigm of Agent AI - The paper introduces a novel cognitive architecture for Agent AI, which is not merely a compilation of existing technologies but a forward-thinking approach to the development of Artificial General Intelligence (AGI) [9] - It defines five core modules: Environment and Perception, Cognition, Action, Learning, and Memory, which together form an interactive cognitive loop [10][26] Core Modules Explained - **Environment and Perception**: Agents actively perceive information from their surroundings in a multimodal manner, incorporating various data types [12][13] - **Cognition**: Acts as the processing center for agents, enabling complex activities such as reasoning and empathy [15][16] - **Action**: Converts cognitive decisions into specific operational commands, affecting both physical and virtual environments [18][19] - **Learning**: Highlights the continuous learning and self-evolution capabilities of agents through various mechanisms [20][21] - **Memory**: Offers a structured system for long-term knowledge retention, allowing agents to leverage past experiences for new tasks [23][24] Role of Large Models - The framework's feasibility is attributed to the maturity of large foundational models, particularly LLMs and VLMs, which provide essential cognitive capabilities for agents [28][29] - These models enable agents to decompose vague instructions into actionable tasks, significantly reducing the complexity of task programming [30][31] Challenges and Ethical Considerations - The paper identifies the issue of "hallucination" in models, where they may generate inaccurate content, posing risks in real-world interactions [32][33] - It emphasizes the need for inclusivity in designing Agent AI, addressing biases in training data and ensuring ethical interactions [36][39] - The importance of establishing regulatory frameworks for data privacy and security in Agent AI applications is also highlighted [38][39] Application Potential - The paper explores the vast application potential of Agent AI in gaming, robotics, and healthcare [40] - In gaming, Agent AI can create dynamic NPCs that interact meaningfully with players, enhancing immersion [42][43] - In robotics, agents can autonomously execute complex tasks based on simple verbal commands, streamlining user interaction [48][49] - In healthcare, Agent AI can assist in preliminary diagnostics and patient monitoring, improving efficiency in resource-limited settings [54][57] Future Directions - The paper acknowledges that Agent AI is still in its early stages, facing challenges in integrating multiple modalities and creating general-purpose agents [58][60] - It proposes new evaluation benchmarks to measure agent intelligence and guide future research [61]
多模态大模型存在「内心预警」,无需训练,就能识别越狱攻击
机器之心· 2025-07-21 08:43
Core Viewpoint - The rise of multimodal large models (LVLMs) has led to significant advancements in tasks such as image-text question answering and visual reasoning, but they are more susceptible to "jailbreaking" attacks compared to pure text models [2][5]. Group 1: Multimodal Model Security Challenges - LVLMs, such as GPT-4V and LLaVA, integrate images and text, enhancing their capabilities but also exposing them to security vulnerabilities [2]. - Existing methods to enhance model security, including cross-modal safety fine-tuning and external discriminator modules, face challenges such as high training costs and poor generalization [3]. Group 2: HiddenDetect Methodology - Researchers from CUHK MMLab and Taotian Group introduced HiddenDetect, a novel jailbreak detection method that does not require training [5]. - The core finding is that LVLMs retain rejection signals in their hidden states even when they generate inappropriate content, particularly in intermediate layers [5][9]. Group 3: Analysis of Rejection Signals - The study constructs a "rejection semantic vector" (RV) from frequently occurring tokens that indicate refusal, allowing for the measurement of rejection signal strength across model layers [9]. - Experimental results show significant differences in rejection signal strength between safe and unsafe inputs, with intermediate layers being more sensitive to safety concerns [9][10]. Group 4: Input Type Sensitivity - The analysis reveals that different input modalities activate distinct safety pathways, with text inputs showing quicker rejection signal activation compared to image-text inputs [17][19]. - The presence of visual modalities can delay the model's rejection response, weakening its safety mechanisms [19]. Group 5: Experimental Results and Effectiveness - The HiddenDetect method was evaluated across multiple mainstream LVLMs, demonstrating robust performance against various attack types while maintaining good generalization capabilities [23]. - The method achieved high detection effectiveness, with the proposed approach outperforming existing methods in terms of robustness and generalization [24]. Group 6: Future Directions - The research emphasizes the importance of safety in deploying large models in real-world applications and aims to expand the capabilities of the detection method while exploring the relationship between modality information and model safety [28].
2025年中国多模态大模型行业市场规模、产业链、竞争格局分析及行业发趋势研判:将更加多元和深入,应用前景越来越广阔[图]
Chan Ye Xin Xi Wang· 2025-05-29 01:47
Core Insights - The multi-modal large model market in China is projected to reach 15.63 billion yuan in 2024, an increase of 6.54 billion yuan from 2023, and is expected to grow to 23.48 billion yuan in 2025, indicating strong market demand and government support [1][6][19] Multi-Modal Large Model Industry Definition and Classification - Multi-modal large models are AI systems capable of processing and understanding various data forms, including text, images, audio, and video, using deep learning technologies like the Transformer architecture [2][4] Industry Development History - The multi-modal large model industry has evolved through several stages: task-oriented phase, visual-language pre-training phase, and the current multi-modal large model phase, focusing on enhancing cross-modal understanding and generation capabilities [4] Current Industry Status - The multi-modal large model industry has gained significant attention due to its data processing capabilities and diverse applications, with a market size projected to grow substantially in the coming years [6][19] Application Scenarios - The largest application share of multi-modal large models is in the digital human sector at 24%, followed by gaming and advertising at 13% each, and smart marketing and social media at 10% each [8] Industry Value Chain - The industry value chain consists of upstream components like AI chips and hardware, midstream multi-modal large models, and downstream applications across various sectors including education, gaming, and public services [10][12] Competitive Landscape - Major players in the multi-modal large model space include institutions and companies like the Chinese Academy of Sciences, Huawei, Baidu, Tencent, and Alibaba, with various models being developed to optimize training costs and enhance capabilities [16][17] Future Development Trends - The multi-modal large model industry is expected to become more intelligent and humanized, providing richer and more personalized user experiences, with applications expanding across various fields such as finance, education, and content creation [19]
用多模态LLM超越YOLOv3!强化学习突破多模态感知极限|开源
量子位· 2025-05-03 04:05
Core Insights - The article discusses the introduction of Perception-R1 (PR1), a groundbreaking multimodal large language model (MLLM) that surpasses previous models like YOLOv3 and Faster-RCNN by achieving over 30 AP on the COCO2017 validation set [1][16]. Group 1: Introduction of Perception-R1 - Perception-R1 is developed by research teams from Huazhong University of Science and Technology, Beijing University of Posts and Telecommunications, and others, focusing on enhancing visual reasoning through rule-based reinforcement learning (RL) [1][5]. - The model aims to improve capabilities in pure visual tasks such as counting and general object detection, as well as visual-language tasks like grounding and OCR [1][4]. Group 2: Importance of Visual Perception in AI - The article emphasizes the need for a revolution in AI visual perception, highlighting the rapid advancements in AI's ability to understand visual information, which is crucial for applications ranging from autonomous driving to medical diagnostics [3][4]. - It points out the subtle differences between recognizing objects and understanding their interactions in detail, indicating that current MLLMs often struggle with complex visual reasoning tasks [4]. Group 3: Role of Reinforcement Learning - The rise of reinforcement learning, particularly techniques like RLHF (Reinforcement Learning from Human Feedback) and rule-based RL, is noted as a transformative factor for language models, prompting the development of Perception-R1 [5][6]. - The article raises the question of whether RL can similarly enhance MLLM's visual perception capabilities, suggesting that early attempts have shown promise but not universal success [6]. Group 4: Perception-R1 Framework - Perception-R1 is not a new MLLM from scratch but a post-training framework designed to significantly enhance the visual perception abilities of existing capable MLLMs [7]. - It employs a technique called Group Relative Policy Optimization (GRPO) to optimize the perception strategy, which is crucial for improving visual task performance [9]. Group 5: Reward Engineering - The article discusses the importance of reward modeling in reinforcement learning, where the reward function guides the learning process by quantifying the model's performance on visual tasks [11]. - Perception-R1's reward structure includes extracting relevant visual details, executing logical operations based on visual understanding, and generating outputs in the correct format [11][17]. Group 6: Experimental Results - Perception-R1's performance is evaluated against strong benchmarks and specialized models, demonstrating significant improvements in visual counting and object detection tasks [16][19]. - For instance, in visual counting tasks, Perception-R1 achieved 78.1 on Pixmo-Count, outperforming other models [19]. Group 7: Scalability and Future Implications - The article concludes that Perception-R1 lays a critical foundation for future advancements in intelligent AI visual perception, suggesting that its principles could play a key role in developing next-generation perceptual AI systems [24][25].
10倍吞吐提升无损性能:多模态适用的KV cache量化策略来了,即插即用无需改原模型
量子位· 2025-04-03 02:12
Core Insights - The article discusses the introduction of CalibQuant, a 1-bit KV cache quantization method for multimodal large language models (LLMs), which significantly enhances throughput while maintaining model performance [1][5][18]. Group 1: Motivation and Challenges - Current multimodal LLMs face challenges in handling large, high-resolution image or video data, where the KV cache mechanism increases memory usage proportionally to input length, limiting throughput [6]. - Existing quantization methods for LLM KV caches do not adequately address the unique visual redundancy in multimodal contexts, making them ineffective under extreme conditions [6][7]. Group 2: Methodology - CalibQuant employs a novel 1-bit quantization strategy that integrates post-scaling and calibration techniques to reduce memory and computational costs without altering the original model [3][5]. - The method includes channel-wise quantization, which refines the statistical range for quantization, thus preserving model performance better than global statistics [9][10]. - A post-scaling management strategy is introduced to optimize the computation order during dequantization, enhancing efficiency and reducing storage needs [11][12]. - A calibration method is proposed to adjust attention scores before softmax, mitigating the impact of extreme values resulting from 1-bit quantization [13][14]. Group 3: Experimental Results - The proposed quantization method was tested on LLaVA and InternVL models across various tasks, showing superior performance compared to existing methods like KIVI and VLCache, particularly in the captioning task [15][18]. - For instance, the method achieved a CIDEr score of 1.109 at 1-bit quantization for the llava-1.5-7b model, surpassing VLCache's score of 1.053 [15]. Group 4: Runtime Analysis - The runtime analysis demonstrated that the 1-bit quantization method consistently outperformed the 16-bit baseline in throughput across different memory budgets, achieving up to 459.016 tokens per second compared to the baseline's 40.816 tokens per second [17]. - This indicates a throughput improvement of approximately 9.88× to 11.24×, showcasing the method's effectiveness under constrained memory conditions [17]. Group 5: Conclusion - The article concludes that the proposed CalibQuant method effectively addresses the challenges of KV cache compression in multimodal LLMs, enhancing both computational efficiency and model performance [18].