多模态大语言模型(MLLM)
Search documents
AAAI 2026 Oral|InfiGUI-G1模型来了,刷新GUI Grounding SOTA
机器之心· 2026-01-05 06:09
Core Insights - The article discusses the advancements in multi-modal large language models (MLLMs) and the challenges in achieving GUI grounding, particularly the distinction between spatial alignment and semantic alignment [2][6][7] - A new framework called Adaptive Exploration Policy Optimization (AEPO) is introduced, which enhances the performance of the InfiGUI-G1 model in GUI grounding tasks [2][14] Group 1: GUI Grounding Challenges - GUI grounding involves mapping natural language commands to specific screen elements, which can be broken down into spatial alignment (accurate positioning) and semantic alignment (correct element identification) [6][7] - Existing methods, particularly those based on Reinforcement Learning with Verification Rewards (RLVR), excel in spatial alignment but struggle with semantic alignment due to issues like the "confidence trap," where models repeatedly make high-confidence but incorrect predictions [8][10] Group 2: InfiGUI-G1 Model and AEPO Framework - The InfiGUI-G1 model, developed by a research team from Zhejiang University, Hong Kong Polytechnic University, and InfiX.ai, utilizes AEPO to overcome exploration inefficiencies in traditional RL methods [2][14] - AEPO consists of three core components: 1. Multi-Answer Generation, which allows the model to generate multiple candidate coordinates in a single pass, increasing the likelihood of finding the correct answer [15] 2. Adaptive Exploration Reward (AER), which evaluates the quality of generated answers based on efficiency principles [16] 3. Collinear Penalty, which discourages the model from generating geometrically aligned points to ensure diverse exploration in semantic space [16] Group 3: Performance Evaluation - InfiGUI-G1 has been evaluated on challenging benchmarks such as MMBench-GUI, ScreenSpot-Pro, and UI-Vision, demonstrating superior performance compared to existing models, including those with significantly larger parameter counts [19] - Notably, InfiGUI-G1-7B outperformed models like Qwen2.5-VL-72B and GPT-4o on several metrics, showcasing its effectiveness in semantic understanding tasks [19] - The model showed over 60% improvement on difficult samples, indicating its ability to uncover previously neglected knowledge due to exploration limitations [20] Group 4: Conclusion and Future Outlook - The success of InfiGUI-G1 highlights that the performance bottleneck in GUI agents lies not only in visual recognition but also in effective reinforcement learning strategies to address semantic alignment issues [23] - The introduction of adaptive exploration mechanisms allows InfiGUI-G1 to achieve superior GUI grounding capabilities with a smaller model size, laying a solid foundation for the development of more intelligent GUI interaction assistants [23][24]
下一代目标检测模型:3B参数MLLM Rex-Omni首度超越Grounding DINO,统一10+视觉任务
机器之心· 2025-11-13 08:26
Core Insights - The article discusses the breakthrough of the Rex-Omni model, which surpasses traditional coordinate regression detectors in target localization accuracy, addressing long-standing criticisms of multimodal large language models (MLLM) [2][4]. Group 1: Model Design and Innovations - Rex-Omni integrates all visual perception tasks into a unified "next point prediction" framework, utilizing efficient 4-Token coordinate encoding and a two-stage GRPO reinforcement learning training process [4][11]. - The model's design includes a unique output format with quantized coordinates and special tokens, allowing for efficient representation of various geometric outputs [13][14]. - Rex-Omni employs multiple data engines (Grounding, Referring, Pointing, and OCR) to generate high-quality training signals, enhancing its semantic understanding and spatial reasoning capabilities [16][17]. Group 2: Training Methodology - The two-stage training approach of SFT (Supervised Fine-Tuning) and GRPO (Geometric Reward-based Policy Optimization) is crucial for achieving high localization accuracy and correcting behavioral deficiencies [19][21]. - GRPO introduces geometric reward functions, enabling the model to learn from its generated sequences and significantly improving performance with minimal additional training steps [19][21]. Group 3: Performance Evaluation - In zero-shot evaluations on core detection benchmarks like COCO and LVIS, Rex-Omni demonstrates superior performance, achieving an F1-score that surpasses traditional models like Grounding DINO [20][22]. - The model excels in dense and small object detection tasks, achieving the highest F1@mIoU performance among MLLMs, showcasing its refined spatial localization capabilities [27][28]. - Rex-Omni's unified framework allows it to effectively handle various visual perception tasks, outperforming traditional open-set detectors in referring object detection [31][34]. Group 4: Conclusion and Future Implications - Rex-Omni represents a significant advancement for MLLMs in visual perception, proving that they can overcome geometric and behavioral limitations to achieve precise geometric perception and robust language understanding [45]. - The model sets a new performance benchmark in the MLLM field and indicates a promising direction for the development of next-generation target detection models [45].
李飞飞最新长文:AI的下一个十年——构建真正具备空间智能的机器
机器之心· 2025-11-10 23:47
Core Insights - The article emphasizes the importance of spatial intelligence as the next frontier in AI, highlighting its potential to transform various fields such as storytelling, creativity, robotics, and scientific discovery [5][6][10]. Summary by Sections What is Spatial Intelligence? - Spatial intelligence is defined as a fundamental aspect of human cognition that enables interaction with the physical world, influencing everyday actions and creative processes [10][13]. - It is essential for tasks ranging from simple activities like parking a car to complex scenarios such as emergency response [10][11]. Importance of Spatial Intelligence - The article argues that spatial intelligence is crucial for understanding and manipulating the world, serving as a scaffold for human cognition [13][15]. - Current AI technologies, while advanced, still lack the spatial reasoning capabilities inherent to humans, limiting their effectiveness in real-world applications [14][15]. Building Spatial Intelligence in AI - To create AI with spatial intelligence, a new type of generative model called "world models" is proposed, which can understand, reason, generate, and interact within complex environments [17][18]. - The world model should possess three core capabilities: generative, multimodal, and interactive [18][19][20]. Challenges Ahead - The development of world models faces significant challenges, including the need for new training tasks, large-scale data, and innovative model architectures [23][24][25]. - The complexity of representing the physical world in AI is much greater than that of language, necessitating breakthroughs in technology and theory [21][22]. Applications of Spatial Intelligence - In creativity, spatial intelligence can enhance storytelling and immersive experiences, allowing creators to build and iterate on 3D worlds more efficiently [32][33]. - In robotics, spatial intelligence is essential for machines to understand and interact with their environments, improving their learning and operational capabilities [34][35][36]. - The potential impact extends to fields like science, medicine, and education, where spatial intelligence can facilitate breakthroughs and enhance learning experiences [38][39][40]. Conclusion - The article concludes that the pursuit of spatial intelligence in AI represents a significant opportunity to enhance human capabilities and address complex challenges, ultimately benefiting society as a whole [42].
FSDrive统一VLA和世界模型,推动自动驾驶迈向视觉推理
3 6 Ke· 2025-09-30 10:36
Core Insights - FSDrive introduces a "Spatio-Temporal Chain-of-Thought" (CoT) that allows models to reason directly with images, addressing the limitations of existing methods that rely heavily on symbolic representations [1][4][17] Group 1: Methodology - The proposed method utilizes a unified future image frame as an intermediary reasoning step, integrating future scenarios and perception results for visual reasoning [4][17] - FSDrive activates image generation capabilities in existing Multi-Modal Large Language Models (MLLM) by expanding the vocabulary with visual tokens, avoiding major architectural changes [5][17] - The approach employs a progressive visual CoT, starting with coarse-grained perception maps (lane lines and 3D boxes) and gradually refining to detailed future frames, explicitly incorporating physical constraints [5][8] Group 2: Performance Metrics - FSDrive demonstrates superior performance in trajectory planning, achieving lower average L2 error (0.53 vs 0.70) and collision rates (0.19 vs 0.21) compared to Doe-1 [9] - In terms of future frame generation quality, FSDrive achieves a FID score of 10.1, outperforming many diffusion-based world models and maintaining real-time capabilities [11] - The model also shows strong results in scene understanding, with a final score of 0.57, surpassing competitors like OminiDrive [14] Group 3: Applications and Implications - FSDrive's dual role as a "world model" for future frame generation and as an "inverse dynamics model" for trajectory planning enhances its interpretability and decision-making capabilities [8][16] - The framework's ability to reduce potential collisions through visual reasoning reflects its practical applicability in real-world autonomous driving scenarios [16][17] - The method's efficiency allows for significant data and computational cost savings, making it a competitive option in the evolving landscape of autonomous driving technologies [17]
NeurIPS 2025 Spotlight | FSDrive统一VLA和世界模型,推动自动驾驶迈向视觉推理
机器之心· 2025-09-30 08:45
Core Insights - The article introduces FSDrive, a novel approach that utilizes "Spatio-Temporal Chain-of-Thought" (CoT) to enhance visual reasoning in autonomous driving, moving away from traditional symbolic logic to a more intuitive visual simulation and imagination process [7][28]. Group 1: Methodology and Innovations - FSDrive proposes a unified "visual intermediary" that replaces text or tabular mediators, effectively eliminating cross-modal semantic gaps [8]. - The method activates image generation capabilities on existing Multi-Modal Large Language Models (MLLM) with minimal cost by expanding the vocabulary to include visual tokens, avoiding major architectural changes or extensive retraining [8][19]. - A progressive visual CoT is employed, starting with coarse-grained perception maps (lane lines and 3D boxes) and gradually generating detailed future frames, explicitly injecting physical realism [8][19]. Group 2: Performance and Metrics - FSDrive demonstrates competitive performance in trajectory planning and scene understanding, achieving an average L2 error of 0.53 and a collision rate of 0.19, outperforming existing methods like UniAD [29][22]. - The quality of future frame generation is indicated by a FID score of 10.1 at a resolution of 128×192, surpassing many diffusion-based world models [22]. - In scene understanding tasks, FSDrive achieves a final score of 0.57, exceeding other recent methods, showcasing the effectiveness of its unified pre-training approach [25]. Group 3: Practical Applications and Future Directions - FSDrive maintains an end-to-end simple link and interpretable visual reasoning while leveraging large amounts of unannotated video data to learn world evolution patterns [9]. - The framework is adaptable to mainstream MLLMs, indicating its potential for broad application in the autonomous driving industry [20]. - Future developments may include expanding the model to predict a unified panoramic view while addressing safety, privacy, and regulatory compliance issues as the technology matures [30].
用多模态LLM超越YOLOv3!强化学习突破多模态感知极限|开源
量子位· 2025-05-03 04:05
Core Insights - The article discusses the introduction of Perception-R1 (PR1), a groundbreaking multimodal large language model (MLLM) that surpasses previous models like YOLOv3 and Faster-RCNN by achieving over 30 AP on the COCO2017 validation set [1][16]. Group 1: Introduction of Perception-R1 - Perception-R1 is developed by research teams from Huazhong University of Science and Technology, Beijing University of Posts and Telecommunications, and others, focusing on enhancing visual reasoning through rule-based reinforcement learning (RL) [1][5]. - The model aims to improve capabilities in pure visual tasks such as counting and general object detection, as well as visual-language tasks like grounding and OCR [1][4]. Group 2: Importance of Visual Perception in AI - The article emphasizes the need for a revolution in AI visual perception, highlighting the rapid advancements in AI's ability to understand visual information, which is crucial for applications ranging from autonomous driving to medical diagnostics [3][4]. - It points out the subtle differences between recognizing objects and understanding their interactions in detail, indicating that current MLLMs often struggle with complex visual reasoning tasks [4]. Group 3: Role of Reinforcement Learning - The rise of reinforcement learning, particularly techniques like RLHF (Reinforcement Learning from Human Feedback) and rule-based RL, is noted as a transformative factor for language models, prompting the development of Perception-R1 [5][6]. - The article raises the question of whether RL can similarly enhance MLLM's visual perception capabilities, suggesting that early attempts have shown promise but not universal success [6]. Group 4: Perception-R1 Framework - Perception-R1 is not a new MLLM from scratch but a post-training framework designed to significantly enhance the visual perception abilities of existing capable MLLMs [7]. - It employs a technique called Group Relative Policy Optimization (GRPO) to optimize the perception strategy, which is crucial for improving visual task performance [9]. Group 5: Reward Engineering - The article discusses the importance of reward modeling in reinforcement learning, where the reward function guides the learning process by quantifying the model's performance on visual tasks [11]. - Perception-R1's reward structure includes extracting relevant visual details, executing logical operations based on visual understanding, and generating outputs in the correct format [11][17]. Group 6: Experimental Results - Perception-R1's performance is evaluated against strong benchmarks and specialized models, demonstrating significant improvements in visual counting and object detection tasks [16][19]. - For instance, in visual counting tasks, Perception-R1 achieved 78.1 on Pixmo-Count, outperforming other models [19]. Group 7: Scalability and Future Implications - The article concludes that Perception-R1 lays a critical foundation for future advancements in intelligent AI visual perception, suggesting that its principles could play a key role in developing next-generation perceptual AI systems [24][25].
AR智能革命!Satori系统读懂人类意图,科幻电影场景成现实
机器之心· 2025-04-28 01:26
在无数科幻电影中,增强现实(AR)通过在人们的眼前叠加动画、文字、图形等可视化信息,让人获得适时的、超越自身感知能力的信息。无论是手术医 生带着 AR 眼镜进行操作,还是智能工厂流水线前的例行检查、或是面对书本时 AR 快速查找翻阅的超能力,是这一切只为一个最终目的——通过适时的信 息辅助我们。 直到今日,大部分 AR 辅助依然停留在需要人工远程接入辅助的层面,与我们期待的智能的、理解性的、可拓展的 AR 辅助相差甚远。这也导致 AR 在重要 产业和生活应用中的普及受到限制。如何能让 AR 在生活中真正做到理解用户、理解环境、并适时的辅助依然面临巨大挑战。 Satori 系统自动识别用户称重 11 g 咖啡的展示 这一切随着 Satori 系统的诞生即将成为过去。来自纽约大学数据与可视化实验室(NYU VIDA)联合 Adobe 的研究人员融合多模态大语言模型(MLLM) 与认知理论 BDI(Belief-desire-intention theory) 让 AI 首次真正意义的去理解使用者的行为、目标以及环境状态 ,最终达到根据不同场景自动适配指 示内容,指示步骤,与判断辅助时机。让 AR 辅助接入智慧核心 ...
AI能看懂图像却算不好距离,上交时间-空间智能基准难倒9大顶尖多模态模型
量子位· 2025-04-15 03:54
Core Insights - The article discusses the increasing application of Multi-Modal Large Language Models (MLLM) in embodied intelligence and autonomous driving, questioning their readiness to understand complex physical environments [1][2] - The introduction of the Spatial-Temporal Intelligence Benchmark (STI-Bench) aims to challenge current MLLMs on their precise spatial-temporal understanding capabilities [1][4] Group 1: MLLM Capabilities - MLLMs have shown significant achievements in visual language understanding but need to surpass traditional semantic understanding to possess accurate spatial-temporal intelligence [2] - The core tasks in AI applications, such as autonomous driving and robotic operations, require quantitative spatial-temporal understanding, which is currently a weak point for existing models [3][19] Group 2: STI-Bench Overview - STI-Bench is designed to evaluate models using real-world video inputs, focusing on precise and quantitative spatial-temporal understanding [4] - The benchmark includes over 300 real-world videos covering three typical scenarios: desktop operations (millimeter-level), indoor environments (centimeter-level), and outdoor scenes (decimeter-level) [6] Group 3: Evaluation Metrics - The evaluation consists of eight tasks divided into two dimensions: static spatial understanding (measuring scale, spatial relationships, and 3D video localization) and dynamic temporal understanding (displacement, speed, acceleration, ego orientation, trajectory description, and pose estimation) [6] - The dataset also includes over 2,000 high-quality question-answer pairs, ensuring accuracy and relevance to the corresponding scenes [8] Group 4: Experimental Results - The evaluation of leading MLLMs, including proprietary models like GPT-4o and Gemini-2.5-Pro, revealed overall poor performance, with the best models achieving less than 42% accuracy, only slightly above random guessing [12][20] - Qwen2.5-VL-72B emerged as a standout, outperforming all proprietary models and providing a boost to the open-source community [13] Group 5: Error Analysis - The research identified three core bottlenecks in MLLMs: inaccuracies in estimating quantitative spatial attributes, deficiencies in understanding temporal dynamics, and weak cross-modal integration capabilities [15][16][17] - These issues highlight the significant gaps in MLLMs' abilities to perform precise spatial-temporal understanding, indicating directions for future research [19][20] Group 6: Conclusion - The results from STI-Bench clearly indicate the serious shortcomings of current MLLMs in precise spatial-temporal understanding, which is essential for their application in embodied intelligence and autonomous driving [20][21] - The release of STI-Bench provides a new benchmark for assessing and improving MLLMs' spatial-temporal understanding capabilities, guiding researchers towards potential solutions [21]