Workflow
多模态大语言模型
icon
Search documents
让模型自己找关键帧、视觉线索,小红书Video-Thinker破解视频推理困局
机器之心· 2026-01-02 03:12
Core Insights - The article discusses the revolutionary advancements in video reasoning through the introduction of the "Thinking with Videos" paradigm, specifically the Video-Thinker model, which enhances the model's ability to autonomously navigate and understand temporal sequences in videos [2][6][10]. Group 1: Model Development and Methodology - Video-Thinker integrates "temporal grounding" and "visual captioning" into the model's cognitive chain, eliminating reliance on external tools and enabling the model to autonomously identify key frames and extract visual cues [2][10]. - The research team constructed the Video-Thinker-10K dataset, consisting of 10,000 high-quality samples, and employed a two-phase training strategy of "supervised fine-tuning + reinforcement learning" to enhance the model's self-exploration and self-correction capabilities [3][10]. - The model achieved state-of-the-art (SOTA) performance in various challenging video reasoning benchmarks, significantly surpassing existing baselines with its 7 billion parameters [3][22]. Group 2: Data Quality and Training Process - The construction of high-quality training data is crucial for developing complex reasoning capabilities, leading to the integration of six major datasets into Video-Thinker-10K, which combines precise temporal annotations with detailed visual descriptions [12][13]. - The training process involved a structured thinking paradigm where the model learns to output specific labels such as <time> and <caption>, ensuring a rigorous "locate - perceive - reason" sequence [16][18]. - The reinforcement learning phase, utilizing Group Relative Policy Optimization (GRPO), allowed the model to explore and optimize its reasoning strategies, leading to emergent cognitive behaviors akin to human metacognition [19][22]. Group 3: Performance Evaluation - Video-Thinker-7B demonstrated significant advantages across various video reasoning benchmarks, establishing a new SOTA for models with 7 billion parameters [25][29]. - The model's performance was evaluated through both in-domain and out-of-domain assessments, showcasing its ability to generalize effectively to unseen scenarios [24][29]. - The model achieved an accuracy of 43.22% on the Video-Holmes benchmark and 80.69% on the VRBench, outperforming previous models by notable margins [29][30]. Group 4: Key Findings and Implications - The model's success is attributed to its internal capabilities of grounding and captioning, which were quantitatively assessed and found to be superior to those of baseline models [32][36]. - The findings indicate that relying on external tools can hinder performance, as demonstrated by experiments showing that simple plug-and-play tools did not enhance, but rather degraded, the model's reasoning capabilities [34][35]. - The article concludes that Video-Thinker's approach of integrating core internal capabilities rather than depending on large parameters and datasets represents a new paradigm in video reasoning, with potential applications across various industries [39].
NeurIPS 2025 | 告别全量扫描!浙大提出COIDO:破解多模态数据选择「高耗」难题
机器之心· 2025-12-13 08:31
Core Insights - The article introduces COIDO (Coupled Importance-Diversity Optimization), a framework designed to optimize data selection for visual instruction tuning in multi-modal large language models (MLLMs) [4][9][23] - COIDO aims to reduce the computational costs associated with data selection while ensuring high-quality data is retained, addressing the challenges of existing methods that often require full data traversal [12][23] Group 1: Motivation and Background - The rapid growth of datasets, such as LLaVA-665K, has led to significant computational overhead and redundancy when fine-tuning MLLMs on full datasets [8] - Existing data selection methods face two main issues: high selection costs and the decoupling of importance and diversity in data selection [12][9] Group 2: Methodology - COIDO introduces a lightweight scoring mechanism that allows for training on a small sample (e.g., 20%) of the full dataset, enabling generalization without the need for full data traversal [14] - The core innovation of COIDO is the coupled optimization of importance and diversity within a unified training framework, rather than treating them as separate phases [14] - The importance loss is based on a reweighted cross-entropy loss, while the diversity loss utilizes spectral clustering to minimize variance among clusters, ensuring a diverse data selection [14][15] Group 3: Experimental Results - COIDO achieves state-of-the-art performance using only 20% of the data, reaching 98.2% of the performance of full data fine-tuning across various benchmarks [20][21] - The framework demonstrates strong generalization and transferability, outperforming models trained from scratch on new datasets [21] Group 4: Conclusion - COIDO presents a novel paradigm for multi-modal data selection, challenging the notion that data selection must be costly and providing a pathway for efficient fine-tuning of MLLMs [23][24] - The framework's low computational cost and high-quality data selection make it a valuable tool for researchers with limited resources [23]
大模型被确诊「视觉文盲」!多校联合提出MILO,为它植入空间想象力
量子位· 2025-12-04 09:55
Core Insights - The article discusses the limitations of multi-modal large language models (MLLMs) in spatial reasoning, highlighting their inability to effectively understand and visualize spatial concepts, leading to a phenomenon termed "visual illiteracy" [2][3]. Group 1: Challenges in Spatial Reasoning - Spatial reasoning is identified as a core cognitive ability for humans to understand three-dimensional structures, which poses a significant challenge for MLLMs in practical applications [2]. - Current methods primarily rely on "language description tuning," which fails to provide models with a true visual understanding of spatial concepts [2][3]. Group 2: Introduction of MILO - A research team has proposed MILO (Implicit Spatial World Modeling) to address the spatial reasoning challenges faced by MLLMs by integrating visual generative feedback with symbolic reasoning [4]. - MILO employs a two-phase training process: the first phase involves visual generative tuning where the model learns spatial transformations through visual outputs, and the second phase involves language tuning using spatial instruction data [5]. Group 3: Enhancements in Geometric Perception - To further enhance geometric perception, the team introduced RePE (Relative Positional Encoding), which captures relative transformations between adjacent frames instead of relying on a global coordinate system, improving generalization and adaptability across datasets [8][9]. Group 4: GeoGen Dataset - The research team constructed the GeoGen dataset, comprising approximately 2,241 videos and 267,000 "observation-action-result" triplets, aimed at enhancing geometric perception generation [10]. - The dataset includes diverse sources such as scanned 3D scenes and internet videos, ensuring a wide range of realistic scenarios [11]. Group 5: Validation of MILO - The effectiveness of MILO was validated across multiple baseline models and five categories of spatial understanding tasks, achieving optimal performance in 3D scene understanding tasks and spatial reasoning tasks [12][16]. - Notably, MILO improved accuracy by 3.2% in the ScanRefer task and achieved an average accuracy of 61.7% in the VSI-Bench spatial reasoning task, surpassing the baseline VG-LLM by 2.2% [16].
腾讯广告算法大赛圆满结束,多位选手现场获得腾讯Offer意向书
Sou Hu Cai Jing· 2025-11-28 04:16
Core Insights - The 2025 Tencent Algorithm Competition successfully held its finals in Shenzhen, with over 2800 teams participating globally, focusing on "multi-modal generative recommendation" [1][5] - The champion team "Echoch," consisting of members from Huazhong University of Science and Technology, Peking University, and University of Science and Technology of China, was awarded Tencent's offer and cash prizes [1] - The competition attracted over 8400 participants from nearly 30 countries, marking a historical high for overseas registrations [5] Competition Overview - The finals featured 20 teams that excelled in a rigorous selection process, showcasing innovative generative recommendation algorithms [1] - A special technical innovation award of 200,000 yuan was granted to the team "料峭春风吹酒醒" from the Institute of Computing Technology, Chinese Academy of Sciences [1] Technological Insights - The competition emphasized the application of advanced technologies such as LLM (Large Language Models) and MLLM (Multi-modal Large Language Models), leading to significant innovations in model performance [3] - The generative recommendation technology is seen as crucial for enhancing advertising precision and user experience, allowing for personalized ad recommendations [5] Industry Implications - Tencent's Vice President, Jiang Jie, highlighted the competition's role in attracting young talent to AI, reinforcing Tencent's commitment to technological innovation and collaboration between academia and industry [3] - The competition's dataset will be open-sourced post-event to foster further academic and industrial technological exchanges [5] Business Development - Tencent's Q3 financial report introduced the "Tencent Advertising AIM+" smart advertising product matrix, which optimizes marketing returns for advertisers [6] - The ongoing exploration of generative recommendation technologies within Tencent's advertising business aims to enhance user experience and drive commercial growth [6]
李飞飞长文火爆硅谷
投资界· 2025-11-14 08:01
Core Insights - The article emphasizes that spatial intelligence is the next frontier for AI, which can revolutionize creativity, robotics, scientific discovery, and more [6][10][14] - It outlines the three core capabilities that a world model must possess: generative, multimodal, and interactive [4][18][19] Group 1: Importance of Spatial Intelligence - Spatial intelligence is foundational to human cognition and influences how individuals interact with the physical world [11][14] - Historical examples illustrate how spatial intelligence has driven significant advancements in civilization, such as Eratosthenes' calculation of the Earth's circumference and Watson and Crick's discovery of DNA structure [12][13] Group 2: Current Limitations of AI - Current AI models, particularly large language models (LLMs), lack the spatial reasoning capabilities that humans possess, limiting their effectiveness in understanding and interacting with the physical world [15][16] - Despite advancements, AI struggles with tasks like estimating distances and navigating environments, indicating a fundamental gap in spatial understanding [15][16] Group 3: Future Directions for AI Development - The development of world models is essential for creating AI that can understand and interact with the world in a human-like manner [18][24] - World models should be capable of generating consistent virtual worlds, processing multimodal inputs, and predicting future states based on actions [18][19][20] Group 4: Applications of Spatial Intelligence - The potential applications of spatial intelligence span various fields, including creativity, robotics, science, medicine, and education [34][35] - In creative industries, tools like World Labs' Marble platform enable creators to build immersive experiences without traditional design constraints [28][29] - In robotics, spatial intelligence can enhance machine learning and human-robot collaboration, making robots more effective in various environments [30][31] Group 5: Vision for the Future - The article envisions a future where AI enhances human capabilities rather than replacing them, emphasizing the importance of aligning AI development with human needs [26][36] - The ultimate goal is to create machines that can understand and interact with the physical world, thereby improving human welfare and addressing significant challenges [38]
破解多模态大模型“选择困难症”!内部决策机制首次揭秘:在冲突信息间疯狂"振荡"
量子位· 2025-11-14 05:38
Core Argument - The article argues that modality following in multi-modal large language models (MLLMs) is a dynamic process influenced by relative reasoning uncertainty and inherent modality preference, rather than a static attribute [1][4][37]. Group 1: Research Contributions - A new toy dataset was constructed to systematically and independently vary the reasoning difficulty of visual and textual inputs, enabling different difficulty combinations for multi-modal inputs [4]. - The study decomposes the explicit behavior of modality following into two core components: case-specific relative reasoning uncertainty and the model's stable inherent modality preference [4][5]. - An empirical finding indicates that the probability of a model following a certain modality decreases monotonically as the relative reasoning uncertainty of that modality increases [5]. Group 2: Framework Design - A controlled dataset was created to validate hypotheses, allowing independent control of visual and textual reasoning complexity [9][10]. - Uncertainty was measured using output entropy, which reflects the model's perceived uncertainty, with lower entropy indicating confident predictions and higher entropy indicating consideration of alternative options [11]. - Relative uncertainty was quantified to measure the confidence gap between text and visual modalities, providing a core metric for subsequent analysis [12]. Group 3: Limitations of Traditional Metrics - Traditional macro metrics like Text Following Rate (TFR) and Visual Following Rate (VFR) were tested on the constructed dataset, revealing confusing patterns that highlight their limitations [14]. - The study identifies a common trend where models perceive text as easier on average, yet exhibit opposite macro preferences, raising questions about the underlying reasons for these discrepancies [15][16]. Group 4: Experimental Paradigm - A new experimental paradigm was designed to decouple model capability from preference, allowing for a clearer understanding of the model's decision-making process [18]. - The researchers grouped data points based on relative uncertainty to create a complete preference curve, reflecting how model preferences change dynamically with relative difficulty [18]. Group 5: Key Experimental Findings - All tested models exhibited a consistent trend where the probability of following text decreases smoothly as text becomes relatively more difficult [19][21]. - The "balance point" was defined as the point where the curve crosses the 50% probability line, serving as a quantifiable measure of inherent modality preference [22]. - The framework successfully explained previous puzzles regarding model behavior by revealing differences in inherent preferences that were not visible in macro metrics [23][24]. Group 6: Internal Mechanisms - The study explored the internal decision-making mechanisms of models, particularly their oscillation behavior when faced with conflicting information near the balance point [29][30]. - The findings indicate that models exhibit higher oscillation counts in ambiguous regions, providing a mechanistic explanation for observed indecision in external behavior [34][36]. Conclusion - The research presents a new framework for understanding modality following in MLLMs, emphasizing the importance of separating model capability from inherent preference, and revealing a robust rule that the likelihood of following a modality decreases with increasing relative uncertainty [37].
破解多模态大模型“选择困难症”!内部决策机制首次揭秘:在冲突信息间疯狂"振荡"
量子位· 2025-11-14 02:04
Core Argument - The article argues that modality following in multi-modal large language models (MLLMs) is a dynamic process influenced by relative reasoning uncertainty and inherent modality preference, rather than a static attribute [1][4][37]. Group 1: Contributions and Findings - A new controlled toy dataset was constructed to systematically manipulate the reasoning difficulty of visual and textual inputs [4]. - The study decomposes modality following into two core components: case-specific relative reasoning uncertainty and the model's stable inherent modality preference [4][5]. - A fundamental finding indicates that the probability of a model following a certain modality decreases monotonically as the relative reasoning uncertainty of that modality increases [5]. - The framework provides a more reasonable method for quantifying inherent preference, defining it as the balance point where the model treats both modalities equally [5][22]. - The research explores the internal decision-making mechanisms of models, revealing oscillations in predictions when uncertainty is near the balance point [5][29]. Group 2: Experimental Design - The researchers established a controlled experimental environment using a novel toy dataset that independently controls visual and textual reasoning complexity [9][10]. - A model-centered uncertainty metric, output entropy, was employed to reflect the model's perceived uncertainty [11]. - Relative single-modal uncertainty was introduced to quantify the confidence gap in each conflicting case, serving as a core metric for subsequent analysis [12]. Group 3: Limitations of Traditional Metrics - Traditional macro metrics like Text Following Rate (TFR) and Visual Following Rate (VFR) were tested on the constructed dataset, revealing confusing patterns that highlight their limitations [14]. - The study identifies two puzzles regarding the models' preferences and difficulty perceptions, suggesting that traditional metrics obscure the true motivations behind model decisions [16][23]. Group 4: New Experimental Paradigm - A new experimental paradigm was designed to decouple model capability from preference, allowing for a clearer understanding of the models' decision-making processes [18]. - The researchers grouped data points based on relative uncertainty to create a complete preference curve reflecting how model preferences change with relative difficulty [18]. Group 5: Key Experimental Discoveries - All tested models exhibited a consistent trend: as text becomes relatively more difficult, the probability of following text decreases smoothly [19][21]. - The balance point quantifies inherent preference, indicating whether a model has a visual or textual bias based on its position on the relative uncertainty axis [22]. - The framework successfully explains the previously mentioned puzzles by revealing differences in inherent preferences among models [23][24]. Group 6: Internal Mechanisms - The study investigates why models exhibit oscillations in decision-making when approaching their balance point, providing a mechanism for observed indecision [29][33]. - The distinction between clear and ambiguous regions in input uncertainty is made, with oscillation frequency being significantly higher in ambiguous regions [30][34].
李飞飞万字长文爆了!定义AI下一个十年
创业邦· 2025-11-12 03:08
Core Insights - The article emphasizes that "spatial intelligence" is the next frontier for AI, enabling machines to transform perception into action and imagination into creation [2][7] - The concept of a "world model" is identified as essential for unlocking spatial intelligence, requiring AI to generate consistent worlds that adhere to physical laws and can process multimodal inputs [3][5] Group 1: Definition and Importance of Spatial Intelligence - Spatial intelligence is described as a foundational capability for human cognition, influencing how individuals interact with the physical world [15][19] - The evolution of spatial intelligence is linked to significant historical advancements, showcasing its role in shaping civilization [21][22] Group 2: Current Limitations of AI - Despite advancements in AI, current models lack the spatial reasoning capabilities that humans possess, particularly in tasks involving distance estimation and physical interactions [22][25] - The limitations of existing AI models hinder their ability to effectively engage with the physical world, impacting their application in various fields [25][26] Group 3: Building a World Model - Constructing a world model requires three core capabilities: generative, multimodal, and interactive, allowing AI to create and manipulate virtual or real environments [27][29][30] - The development of a world model is seen as a significant challenge for the next decade, necessitating innovative approaches and methodologies [31][32] Group 4: Applications of Spatial Intelligence - The potential applications of spatial intelligence span various domains, including creative industries, robotics, and scientific research, promising to enhance human capabilities [38][48] - Specific use cases include revolutionizing storytelling, improving robotic interactions, and transforming educational experiences through immersive learning [40][44][49] Group 5: Future Vision - The article envisions a future where AI, equipped with spatial intelligence, can serve as a partner in addressing complex challenges, enhancing human creativity, and improving quality of life [51] - The collaborative effort of the entire AI ecosystem is deemed essential for realizing this vision, highlighting the need for collective innovation and development [39][50]
年度服务时长首破万亿分钟,声网乘对话式AI东风
Sou Hu Cai Jing· 2025-11-03 13:17
Core Insights - Agora, Inc. (声网) has achieved significant milestones, including surpassing 1 trillion service minutes annually and launching multiple new products, indicating a positive trajectory for the company [1] - The rise of multimodal AI models has led to increased enterprise investment in voice AI, with 67% of companies placing voice AI at the strategic core and 84% planning to increase investments in the coming year [1] - Agora has recently partnered with OpenAI to launch the first Realtime API for low-latency voice interaction, marking a strategic shift towards conversational AI [3] Company Developments - Agora's CEO Zhao Bin announced the company's annual service minutes exceeded 1 trillion, highlighting growth and product innovation [1] - The company has introduced several products focused on conversational AI, including a new AI engine that enhances dialogue capabilities and supports various ASR and TTS providers [4] - Agora's revenue for Q2 2025 was reported at $34.3 million, a slight increase of 0.5% year-over-year, with a net profit of $1.5 million, indicating a return to profitability [5] Industry Trends - The conversational AI market is projected to grow significantly, with ARK Invest estimating potential growth from $30 million to between $70 billion and $150 billion in the AI companionship sector [5] - Despite advancements, only 21% of users are satisfied with current AI dialogue experiences, indicating room for improvement in areas such as low-latency response and emotional understanding [5] - The integration of conversational AI into business strategies is becoming increasingly important, with companies recognizing its potential as a key component of next-generation AI infrastructure [5]
超越英伟达Describe Anything!中科院 & 字节联合提出「GAR」,为DeepSeek-OCR添砖加瓦
量子位· 2025-10-28 05:12
Core Insights - The article discusses the innovative approach "Vision as Context Compression" proposed by DeepSeek-OCR, focusing on using OCR capabilities to compress documents through images [1] - The collaboration between the Chinese Academy of Sciences and ByteDance introduces "Grasp Any Region" (GAR), which explores the potential of natural images as a means of text compression [2] - GAR's precise region captioning capability is highlighted as a potential pathway for constructing dense captions for natural images [4] Summary by Sections GAR Capabilities - GAR possesses three main abilities: accurately describing user-specified regions, modeling relationships between multiple regions, and performing complex combinatorial reasoning [5][7] - The model allows users to provide various visual prompts and instructions for precise understanding of specific regions [9][10] Importance of Region MLLMs - Region MLLMs differ from traditional MLLMs by enabling fine-grained, interactive understanding of image/video content [8] - The article emphasizes the challenge of evaluating full-image captions, while region captions can be objectively assessed based on color, texture, shape, and material [12] Trade-off Between Local and Global Information - The article discusses the dilemma faced by Region MLLMs in balancing local details and global context [15] - Examples are provided to illustrate how GAR outperforms other models like DAM in accurately identifying and describing specified regions [18][19] Model Design and Mechanism - GAR's design follows the principle of achieving fine-grained understanding while retaining global context [39] - The introduction of a lightweight prompt encoding mechanism and RoI-Aligned Feature Replay allows for high-fidelity feature extraction from specified regions [46][49] Data Pipeline and Training - The training process involves multiple stages to enhance recognition capabilities and support multi-region associative reasoning [57][59][61] - The creation of GAR-Bench aims to systematically evaluate the region-level understanding capabilities of multimodal large language models (MLLMs) [64] Performance Evaluation - GAR models demonstrate superior performance in various benchmark tests, achieving high scores in both single-region and multi-region understanding tasks [71][74] - The results indicate GAR's effectiveness in generating rich, accurate, and detailed local descriptions, establishing it as a state-of-the-art solution [77] Zero-shot Transfer to Video Tasks - GAR's capabilities extend to video tasks, showing strong performance in zero-shot settings, even surpassing models specifically trained for video [79] - The article concludes with the potential applications of GAR in training multimodal understanding models and enhancing complex text instruction adherence [80][81]