IntentBench

Search documents
阿里多模态推理模型开源!精准捕捉视频隐藏信息,三大杀手锏让AI更懂“人情世故”
Sou Hu Cai Jing· 2025-07-09 00:28
Core Insights - Alibaba's Tongyi Lab has released the open-source multimodal reasoning model HumanOmniV2, which enhances understanding of multimodal information through advanced contextual summarization and a multidimensional reward system [1][4][24] - HumanOmniV2 achieves an accuracy of 69.33% on the IntentBench evaluation benchmark, which includes 633 videos and 2,689 related questions [4][24] Group 1: Model Features and Performance - HumanOmniV2 incorporates a forced contextual summarization mechanism and a GRPO-based optimization training method to improve the understanding of hidden information in images, videos, and audio [1][20] - The model's ability to analyze multimodal inputs allows it to provide nuanced answers, such as interpreting a woman's eye-rolling as a playful reaction rather than dissatisfaction [1] - In various tests, HumanOmniV2 demonstrated superior performance in emotional state recognition compared to traditional models, identifying complex emotions like helplessness and anger [14][24] Group 2: Challenges in Multimodal Reasoning - Existing multimodal reasoning models face challenges such as insufficient global context understanding and simplistic reasoning paths, which can lead to incorrect answers [18][20] - The model addresses these issues by integrating a comprehensive understanding of multimodal context, ensuring that critical information is not overlooked during reasoning [20][24] Group 3: Training and Evaluation - The development of HumanOmniV2 involved creating a large-scale multimodal reasoning training dataset that combines contextual information from images, videos, and audio [20][24] - The IntentBench benchmark was introduced to effectively evaluate the model's ability to understand complex human intentions and emotions, requiring deep contextual understanding and observation [20][24] Group 4: Future Directions - Alibaba plans to explore methods for multiple validations of multimodal information during reasoning to enhance accuracy as context and pre-training scales increase [27]
突破全模态AI理解边界:引入上下文强化学习,赋能全模态模型“意图”推理新高度
量子位· 2025-07-08 07:30
Core Viewpoint - The article emphasizes the increasing need for deep understanding and analysis of human intent in the context of multimodal large language models (MLLMs) and highlights the challenges faced in applying reinforcement learning (RL) effectively to complex multimodal data and formats [1][4]. Group 1: Challenges in Multimodal Reasoning - Insufficient global context understanding leads to incorrect answers when models fail to accurately identify or misinterpret multimodal evidence and contextual information [3]. - The shortcut problem arises when models overlook key clues and provide answers without fully considering multimodal information, resulting in suboptimal or partial outcomes [4]. Group 2: Innovations and Advantages - HumanOmniV2 introduces a mandatory context summarization before reasoning, ensuring models do not skip critical multimodal input and providing comprehensive global background support [12]. - A multidimensional reward mechanism is implemented, including context reward, format reward, and accuracy reward, to guide models in accurately understanding multimodal context [13][14]. - The model encourages complex logical reasoning by evaluating whether the reasoning process successfully integrates multimodal information and employs advanced logical analysis techniques [15]. Group 3: Model Design and Training Strategies - The model is based on Qwen2.5-Omni-Thinker, with improvements to the Group Relative Policy Optimization (GRPO) method to enhance training efficiency, fairness, and robustness [19][20]. - Token-level loss is introduced to address the imbalance in long sequence training, ensuring balanced optimization for each token [19]. - The removal of question-level normalization terms promotes consistency in the optimization process across different problem difficulties [19]. - Dynamic KL divergence is utilized to enhance exploration capabilities and training stability throughout the training cycle [20]. Group 4: High-Quality Datasets and Benchmarks - A comprehensive multimodal reasoning training dataset has been created, incorporating image, video, and audio understanding tasks with rich contextual information [23]. - IntentBench, a new multimodal benchmark, evaluates models' abilities to understand human behavior and intent in complex scenarios, featuring 633 videos and 2,689 related questions [23]. Group 5: Experimental Results - HumanOmniV2 achieved breakthrough results across multiple benchmark datasets, attaining 58.47% on Daily-Omni, 47.1% on WorldSense, and 69.33% on the newly introduced IntentBench, outperforming existing open-source multimodal models [24].