多模态大语言模型
Search documents
李飞飞长文火爆硅谷
投资界· 2025-11-14 08:01
Core Insights - The article emphasizes that spatial intelligence is the next frontier for AI, which can revolutionize creativity, robotics, scientific discovery, and more [6][10][14] - It outlines the three core capabilities that a world model must possess: generative, multimodal, and interactive [4][18][19] Group 1: Importance of Spatial Intelligence - Spatial intelligence is foundational to human cognition and influences how individuals interact with the physical world [11][14] - Historical examples illustrate how spatial intelligence has driven significant advancements in civilization, such as Eratosthenes' calculation of the Earth's circumference and Watson and Crick's discovery of DNA structure [12][13] Group 2: Current Limitations of AI - Current AI models, particularly large language models (LLMs), lack the spatial reasoning capabilities that humans possess, limiting their effectiveness in understanding and interacting with the physical world [15][16] - Despite advancements, AI struggles with tasks like estimating distances and navigating environments, indicating a fundamental gap in spatial understanding [15][16] Group 3: Future Directions for AI Development - The development of world models is essential for creating AI that can understand and interact with the world in a human-like manner [18][24] - World models should be capable of generating consistent virtual worlds, processing multimodal inputs, and predicting future states based on actions [18][19][20] Group 4: Applications of Spatial Intelligence - The potential applications of spatial intelligence span various fields, including creativity, robotics, science, medicine, and education [34][35] - In creative industries, tools like World Labs' Marble platform enable creators to build immersive experiences without traditional design constraints [28][29] - In robotics, spatial intelligence can enhance machine learning and human-robot collaboration, making robots more effective in various environments [30][31] Group 5: Vision for the Future - The article envisions a future where AI enhances human capabilities rather than replacing them, emphasizing the importance of aligning AI development with human needs [26][36] - The ultimate goal is to create machines that can understand and interact with the physical world, thereby improving human welfare and addressing significant challenges [38]
破解多模态大模型“选择困难症”!内部决策机制首次揭秘:在冲突信息间疯狂"振荡"
量子位· 2025-11-14 05:38
Core Argument - The article argues that modality following in multi-modal large language models (MLLMs) is a dynamic process influenced by relative reasoning uncertainty and inherent modality preference, rather than a static attribute [1][4][37]. Group 1: Research Contributions - A new toy dataset was constructed to systematically and independently vary the reasoning difficulty of visual and textual inputs, enabling different difficulty combinations for multi-modal inputs [4]. - The study decomposes the explicit behavior of modality following into two core components: case-specific relative reasoning uncertainty and the model's stable inherent modality preference [4][5]. - An empirical finding indicates that the probability of a model following a certain modality decreases monotonically as the relative reasoning uncertainty of that modality increases [5]. Group 2: Framework Design - A controlled dataset was created to validate hypotheses, allowing independent control of visual and textual reasoning complexity [9][10]. - Uncertainty was measured using output entropy, which reflects the model's perceived uncertainty, with lower entropy indicating confident predictions and higher entropy indicating consideration of alternative options [11]. - Relative uncertainty was quantified to measure the confidence gap between text and visual modalities, providing a core metric for subsequent analysis [12]. Group 3: Limitations of Traditional Metrics - Traditional macro metrics like Text Following Rate (TFR) and Visual Following Rate (VFR) were tested on the constructed dataset, revealing confusing patterns that highlight their limitations [14]. - The study identifies a common trend where models perceive text as easier on average, yet exhibit opposite macro preferences, raising questions about the underlying reasons for these discrepancies [15][16]. Group 4: Experimental Paradigm - A new experimental paradigm was designed to decouple model capability from preference, allowing for a clearer understanding of the model's decision-making process [18]. - The researchers grouped data points based on relative uncertainty to create a complete preference curve, reflecting how model preferences change dynamically with relative difficulty [18]. Group 5: Key Experimental Findings - All tested models exhibited a consistent trend where the probability of following text decreases smoothly as text becomes relatively more difficult [19][21]. - The "balance point" was defined as the point where the curve crosses the 50% probability line, serving as a quantifiable measure of inherent modality preference [22]. - The framework successfully explained previous puzzles regarding model behavior by revealing differences in inherent preferences that were not visible in macro metrics [23][24]. Group 6: Internal Mechanisms - The study explored the internal decision-making mechanisms of models, particularly their oscillation behavior when faced with conflicting information near the balance point [29][30]. - The findings indicate that models exhibit higher oscillation counts in ambiguous regions, providing a mechanistic explanation for observed indecision in external behavior [34][36]. Conclusion - The research presents a new framework for understanding modality following in MLLMs, emphasizing the importance of separating model capability from inherent preference, and revealing a robust rule that the likelihood of following a modality decreases with increasing relative uncertainty [37].
破解多模态大模型“选择困难症”!内部决策机制首次揭秘:在冲突信息间疯狂"振荡"
量子位· 2025-11-14 02:04
Core Argument - The article argues that modality following in multi-modal large language models (MLLMs) is a dynamic process influenced by relative reasoning uncertainty and inherent modality preference, rather than a static attribute [1][4][37]. Group 1: Contributions and Findings - A new controlled toy dataset was constructed to systematically manipulate the reasoning difficulty of visual and textual inputs [4]. - The study decomposes modality following into two core components: case-specific relative reasoning uncertainty and the model's stable inherent modality preference [4][5]. - A fundamental finding indicates that the probability of a model following a certain modality decreases monotonically as the relative reasoning uncertainty of that modality increases [5]. - The framework provides a more reasonable method for quantifying inherent preference, defining it as the balance point where the model treats both modalities equally [5][22]. - The research explores the internal decision-making mechanisms of models, revealing oscillations in predictions when uncertainty is near the balance point [5][29]. Group 2: Experimental Design - The researchers established a controlled experimental environment using a novel toy dataset that independently controls visual and textual reasoning complexity [9][10]. - A model-centered uncertainty metric, output entropy, was employed to reflect the model's perceived uncertainty [11]. - Relative single-modal uncertainty was introduced to quantify the confidence gap in each conflicting case, serving as a core metric for subsequent analysis [12]. Group 3: Limitations of Traditional Metrics - Traditional macro metrics like Text Following Rate (TFR) and Visual Following Rate (VFR) were tested on the constructed dataset, revealing confusing patterns that highlight their limitations [14]. - The study identifies two puzzles regarding the models' preferences and difficulty perceptions, suggesting that traditional metrics obscure the true motivations behind model decisions [16][23]. Group 4: New Experimental Paradigm - A new experimental paradigm was designed to decouple model capability from preference, allowing for a clearer understanding of the models' decision-making processes [18]. - The researchers grouped data points based on relative uncertainty to create a complete preference curve reflecting how model preferences change with relative difficulty [18]. Group 5: Key Experimental Discoveries - All tested models exhibited a consistent trend: as text becomes relatively more difficult, the probability of following text decreases smoothly [19][21]. - The balance point quantifies inherent preference, indicating whether a model has a visual or textual bias based on its position on the relative uncertainty axis [22]. - The framework successfully explains the previously mentioned puzzles by revealing differences in inherent preferences among models [23][24]. Group 6: Internal Mechanisms - The study investigates why models exhibit oscillations in decision-making when approaching their balance point, providing a mechanism for observed indecision [29][33]. - The distinction between clear and ambiguous regions in input uncertainty is made, with oscillation frequency being significantly higher in ambiguous regions [30][34].
李飞飞万字长文爆了!定义AI下一个十年
创业邦· 2025-11-12 03:08
Core Insights - The article emphasizes that "spatial intelligence" is the next frontier for AI, enabling machines to transform perception into action and imagination into creation [2][7] - The concept of a "world model" is identified as essential for unlocking spatial intelligence, requiring AI to generate consistent worlds that adhere to physical laws and can process multimodal inputs [3][5] Group 1: Definition and Importance of Spatial Intelligence - Spatial intelligence is described as a foundational capability for human cognition, influencing how individuals interact with the physical world [15][19] - The evolution of spatial intelligence is linked to significant historical advancements, showcasing its role in shaping civilization [21][22] Group 2: Current Limitations of AI - Despite advancements in AI, current models lack the spatial reasoning capabilities that humans possess, particularly in tasks involving distance estimation and physical interactions [22][25] - The limitations of existing AI models hinder their ability to effectively engage with the physical world, impacting their application in various fields [25][26] Group 3: Building a World Model - Constructing a world model requires three core capabilities: generative, multimodal, and interactive, allowing AI to create and manipulate virtual or real environments [27][29][30] - The development of a world model is seen as a significant challenge for the next decade, necessitating innovative approaches and methodologies [31][32] Group 4: Applications of Spatial Intelligence - The potential applications of spatial intelligence span various domains, including creative industries, robotics, and scientific research, promising to enhance human capabilities [38][48] - Specific use cases include revolutionizing storytelling, improving robotic interactions, and transforming educational experiences through immersive learning [40][44][49] Group 5: Future Vision - The article envisions a future where AI, equipped with spatial intelligence, can serve as a partner in addressing complex challenges, enhancing human creativity, and improving quality of life [51] - The collaborative effort of the entire AI ecosystem is deemed essential for realizing this vision, highlighting the need for collective innovation and development [39][50]
年度服务时长首破万亿分钟,声网乘对话式AI东风
Sou Hu Cai Jing· 2025-11-03 13:17
Core Insights - Agora, Inc. (声网) has achieved significant milestones, including surpassing 1 trillion service minutes annually and launching multiple new products, indicating a positive trajectory for the company [1] - The rise of multimodal AI models has led to increased enterprise investment in voice AI, with 67% of companies placing voice AI at the strategic core and 84% planning to increase investments in the coming year [1] - Agora has recently partnered with OpenAI to launch the first Realtime API for low-latency voice interaction, marking a strategic shift towards conversational AI [3] Company Developments - Agora's CEO Zhao Bin announced the company's annual service minutes exceeded 1 trillion, highlighting growth and product innovation [1] - The company has introduced several products focused on conversational AI, including a new AI engine that enhances dialogue capabilities and supports various ASR and TTS providers [4] - Agora's revenue for Q2 2025 was reported at $34.3 million, a slight increase of 0.5% year-over-year, with a net profit of $1.5 million, indicating a return to profitability [5] Industry Trends - The conversational AI market is projected to grow significantly, with ARK Invest estimating potential growth from $30 million to between $70 billion and $150 billion in the AI companionship sector [5] - Despite advancements, only 21% of users are satisfied with current AI dialogue experiences, indicating room for improvement in areas such as low-latency response and emotional understanding [5] - The integration of conversational AI into business strategies is becoming increasingly important, with companies recognizing its potential as a key component of next-generation AI infrastructure [5]
超越英伟达Describe Anything!中科院 & 字节联合提出「GAR」,为DeepSeek-OCR添砖加瓦
量子位· 2025-10-28 05:12
Core Insights - The article discusses the innovative approach "Vision as Context Compression" proposed by DeepSeek-OCR, focusing on using OCR capabilities to compress documents through images [1] - The collaboration between the Chinese Academy of Sciences and ByteDance introduces "Grasp Any Region" (GAR), which explores the potential of natural images as a means of text compression [2] - GAR's precise region captioning capability is highlighted as a potential pathway for constructing dense captions for natural images [4] Summary by Sections GAR Capabilities - GAR possesses three main abilities: accurately describing user-specified regions, modeling relationships between multiple regions, and performing complex combinatorial reasoning [5][7] - The model allows users to provide various visual prompts and instructions for precise understanding of specific regions [9][10] Importance of Region MLLMs - Region MLLMs differ from traditional MLLMs by enabling fine-grained, interactive understanding of image/video content [8] - The article emphasizes the challenge of evaluating full-image captions, while region captions can be objectively assessed based on color, texture, shape, and material [12] Trade-off Between Local and Global Information - The article discusses the dilemma faced by Region MLLMs in balancing local details and global context [15] - Examples are provided to illustrate how GAR outperforms other models like DAM in accurately identifying and describing specified regions [18][19] Model Design and Mechanism - GAR's design follows the principle of achieving fine-grained understanding while retaining global context [39] - The introduction of a lightweight prompt encoding mechanism and RoI-Aligned Feature Replay allows for high-fidelity feature extraction from specified regions [46][49] Data Pipeline and Training - The training process involves multiple stages to enhance recognition capabilities and support multi-region associative reasoning [57][59][61] - The creation of GAR-Bench aims to systematically evaluate the region-level understanding capabilities of multimodal large language models (MLLMs) [64] Performance Evaluation - GAR models demonstrate superior performance in various benchmark tests, achieving high scores in both single-region and multi-region understanding tasks [71][74] - The results indicate GAR's effectiveness in generating rich, accurate, and detailed local descriptions, establishing it as a state-of-the-art solution [77] Zero-shot Transfer to Video Tasks - GAR's capabilities extend to video tasks, showing strong performance in zero-shot settings, even surpassing models specifically trained for video [79] - The article concludes with the potential applications of GAR in training multimodal understanding models and enhancing complex text instruction adherence [80][81]
首个地球科学智能体Earth-Agent来了,解锁地球观测数据分析新范式
机器之心· 2025-10-27 08:44
Core Insights - The article discusses the development of Earth-Agent, a multi-modal large language model (LLM) designed to enhance Earth science research by automating complex analytical tasks and mimicking expert capabilities [3][10]. Group 1: Earth-Agent Overview - Earth-Agent aims to function as an "AI scientist" capable of understanding research intentions and autonomously planning analysis workflows [3]. - The model can process raw spectral data, remote sensing images, and Earth product data, performing tasks from data preprocessing to spatiotemporal analysis [3][10]. Group 2: Framework and Methodology - The Earth-Agent framework consists of two key components: encapsulation of domain knowledge into standardized, executable functions and the use of LLM for intelligent planning and scheduling [10]. - A total of 104 specialized tools have been integrated into the tool library, allowing the agent to dynamically select the most appropriate tools for various tasks [10]. Group 3: Benchmarking and Evaluation - Earth-Bench, a dataset used for evaluating Earth-Agent, includes 248 expert-annotated tasks across 13,729 images, emphasizing the agent's ability to execute complete Earth science analysis workflows [12][13]. - The evaluation process includes both step-by-step reasoning and end-to-end assessments, focusing on the reasoning process as well as the final results [17]. Group 4: Performance Comparison - Earth-Agent outperforms traditional agent architectures and MLLM methods in various tasks, demonstrating superior capabilities in Earth observation tasks [22]. - In comparative experiments, Earth-Agent achieved an average accuracy of 55.83% across different modalities, significantly higher than other models [22]. Group 5: Future Directions - The article suggests that Earth-Agent represents a new learning paradigm, externalizing capabilities into a structured tool library rather than encoding all knowledge within the model [26]. - Future developments may include expanding the tool library, addressing issues like "tool hallucination," and integrating visual capabilities to enhance tool perception [26].
RAG、Search Agent不香了?苹果DeepMMSearch-R1杀入多模态搜索新战场
3 6 Ke· 2025-10-17 02:44
Core Insights - Apple has introduced a new model called DeepMMSearch-R1, which enhances multimodal large language models (MLLMs) for web search by enabling dynamic querying and self-correction during multi-round interactions [1][6]. Model Development - The DeepMMSearch-R1 model addresses limitations in existing methods like retrieval-augmented generation (RAG) and search agents, which often suffer from inefficiencies and poor results due to rigid processes and excessive search calls [1][3]. - The model employs a two-stage training process: supervised fine-tuning (SFT) followed by online reinforcement learning (RL) using the Group-Relative Policy Optimization (GRPO) algorithm [3][5][10]. Dataset Creation - Apple has created a new dataset named DeepMMSearchVQA, which includes diverse visual question-answering samples presented in multi-turn dialogue format, ensuring a balanced distribution across different knowledge categories [3][7]. - The dataset consists of approximately 47,000 refined dialogue samples, derived from a random selection of 200,000 samples from the InfoSeek training set, ensuring quality by retaining only those dialogues that align with the predictions of the Gemini-2.5-Pro model [7]. Search Process Integration - The model integrates three tools: a text search tool for targeted queries, a Grounding DINO-based image localization tool for identifying relevant areas in images, and an image search tool for retrieving web content based on input images [4][5]. - This targeted search approach significantly improves retrieval quality and overall performance [3][4]. Performance Metrics - The DeepMMSearch-R1 model has shown significant performance improvements over RAG workflows and prompt-based search agents, achieving a +21.13% and +8.89% increase in performance, respectively [13]. - The model's performance is comparable to OpenAI's o3, indicating its competitive edge in the market [13]. Training Efficiency - The SFT phase focuses on enhancing the language model's reasoning capabilities for web retrieval, while the RL phase optimizes tool selection behavior by reducing unnecessary calls [16][17]. - The model maintains its general visual question-answering capabilities while learning to interact with web search tools effectively [19][20].
RAG、Search Agent不香了?苹果DeepMMSearch-R1杀入多模态搜索新战场
机器之心· 2025-10-17 02:11
Core Insights - Apple has introduced a new solution for empowering multimodal large language models (MLLMs) in multimodal web search, addressing inefficiencies in existing methods like retrieval-augmented generation (RAG) and search agents [1][5]. Group 1: Model Development - The DeepMMSearch-R1 model allows for on-demand multi-round web searches and dynamically generates queries for text and image search tools, improving efficiency and results [1][3]. - A two-stage training process is employed, starting with supervised fine-tuning (SFT) followed by online reinforcement learning (RL) using the GRPO algorithm, aimed at optimizing search initiation and tool usage [3][4]. Group 2: Dataset Creation - Apple has created a new dataset called DeepMMSearchVQA, which includes diverse multi-hop visual question-answering samples presented in multi-round dialogue format, balancing different knowledge categories [4][7]. - The dataset construction involved selecting 200,000 samples from the InfoSeek training set, resulting in approximately 47,000 refined dialogue samples for training [7]. Group 3: Training Process - In the SFT phase, the Qwen2.5-VL-7B model is fine-tuned to enhance its reasoning capabilities for web search information while keeping the visual encoder frozen [9]. - The RL phase utilizes GRPO to improve training stability by comparing candidate responses generated under the same prompt, optimizing the model's tool selection behavior [10][12]. Group 4: Performance Results - The DeepMMSearch-R1 model significantly outperforms RAG workflows and prompt-based search agents, achieving a performance increase of +21.13% and +8.89% respectively [16]. - The model's ability to perform targeted image searches and self-reflection enhances overall performance, as demonstrated in various experiments [16][18]. Group 5: Tool Utilization - The model's tool usage behavior aligns with dataset characteristics, with 87.7% tool invocation in the DynVQA dataset and 43.5% in the OKVQA dataset [20]. - The RL model effectively corrects unnecessary tool usage observed in the SFT model, highlighting the importance of RL in optimizing tool efficiency [21]. Group 6: Generalization Capability - The use of LoRA modules during SFT and KL penalty in online GRPO training helps maintain the model's general visual question-answering capabilities across multiple datasets [23][24].
不再靠「猜坐标」!颜水成团队等联合发布PaDT多模态大模型:实现真正的多模态表征输出
机器之心· 2025-10-16 00:51
Core Insights - The article discusses the advancements in Multimodal Large Language Models (MLLMs) and introduces a new paradigm called Patch-as-Decodable Token (PaDT) to address the limitations of existing models in tasks requiring fine spatial understanding [2][6]. Group 1: PaDT Overview - PaDT proposes a revolutionary approach by dividing images into multiple visual patches and allowing the model to generate corresponding Visual Reference Tokens (VRTs) directly [3]. - It enables seamless alternation between text tokens and visual tokens at both input and output stages, making the model's description of image content as natural as describing text [4]. - The model can directly indicate image targets in generated sentences rather than guessing coordinates [5]. Group 2: Limitations of Traditional MLLMs - Traditional MLLMs output detection box coordinates in string format, leading to inconsistencies, semantic disconnection, and weak image-text associations [8]. - The output format can vary, making it difficult to parse targets, and numbers can be split into separate tokens, disrupting spatial continuity [8]. - The reliance on coordinate tokens, which lack inherent semantic meaning, results in challenges such as hallucination and repetition in generated outputs [8]. Group 3: PaDT Mechanism - PaDT introduces VRTs derived from the visual patch embeddings of the input image, creating a dynamic embedding table that integrates both text and visual information [11]. - This design avoids the pitfalls of traditional methods that depend on global visual codebooks, which can confuse similar objects and generate non-existent patches [13]. - The lightweight PaDT Decoder, consisting of three bidirectional attention blocks, transforms VRTs into structured visual outputs like bounding boxes and segmentation masks [15]. Group 4: Performance Metrics - PaDT Pro (3B) achieved a remarkable average accuracy of 93.6 in the RefCOCO/+/g referring expression comprehension task, surpassing the 78B InternVL3 model, which scored 91.4 [21][22]. - In the COCO open vocabulary detection task, traditional MLLMs typically have a mean Average Precision (mAP) below 20, while PaDT Pro (3B) raised it to 38.2, nearly doubling the performance [21][24]. - The model also demonstrated strong performance in the Referring Image Captioning (RIC) task, significantly improving the CIDEr-D score from 0.386 to 1.450 [24]. Group 5: Implications and Future Directions - PaDT's success stems from its deep understanding of the visual capability bottlenecks in MLLMs, allowing for native alignment between visual patches and generated tokens [31]. - The dynamic embedding mechanism ensures strong binding of VRTs to the current image, preventing cross-image confusion [31]. - The model exhibits robust multitasking capabilities, outperforming single-task models by seamlessly switching tasks through prompt changes [33]. - The introduction of PaDT marks a significant step towards achieving true multimodal intelligence, allowing for more natural interactions between different modalities [35].