Workflow
多模态大语言模型(MLLM)
icon
Search documents
下一代目标检测模型:3B参数MLLM Rex-Omni首度超越Grounding DINO,统一10+视觉任务
机器之心· 2025-11-13 08:26
Core Insights - The article discusses the breakthrough of the Rex-Omni model, which surpasses traditional coordinate regression detectors in target localization accuracy, addressing long-standing criticisms of multimodal large language models (MLLM) [2][4]. Group 1: Model Design and Innovations - Rex-Omni integrates all visual perception tasks into a unified "next point prediction" framework, utilizing efficient 4-Token coordinate encoding and a two-stage GRPO reinforcement learning training process [4][11]. - The model's design includes a unique output format with quantized coordinates and special tokens, allowing for efficient representation of various geometric outputs [13][14]. - Rex-Omni employs multiple data engines (Grounding, Referring, Pointing, and OCR) to generate high-quality training signals, enhancing its semantic understanding and spatial reasoning capabilities [16][17]. Group 2: Training Methodology - The two-stage training approach of SFT (Supervised Fine-Tuning) and GRPO (Geometric Reward-based Policy Optimization) is crucial for achieving high localization accuracy and correcting behavioral deficiencies [19][21]. - GRPO introduces geometric reward functions, enabling the model to learn from its generated sequences and significantly improving performance with minimal additional training steps [19][21]. Group 3: Performance Evaluation - In zero-shot evaluations on core detection benchmarks like COCO and LVIS, Rex-Omni demonstrates superior performance, achieving an F1-score that surpasses traditional models like Grounding DINO [20][22]. - The model excels in dense and small object detection tasks, achieving the highest F1@mIoU performance among MLLMs, showcasing its refined spatial localization capabilities [27][28]. - Rex-Omni's unified framework allows it to effectively handle various visual perception tasks, outperforming traditional open-set detectors in referring object detection [31][34]. Group 4: Conclusion and Future Implications - Rex-Omni represents a significant advancement for MLLMs in visual perception, proving that they can overcome geometric and behavioral limitations to achieve precise geometric perception and robust language understanding [45]. - The model sets a new performance benchmark in the MLLM field and indicates a promising direction for the development of next-generation target detection models [45].
李飞飞最新长文:AI的下一个十年——构建真正具备空间智能的机器
机器之心· 2025-11-10 23:47
Core Insights - The article emphasizes the importance of spatial intelligence as the next frontier in AI, highlighting its potential to transform various fields such as storytelling, creativity, robotics, and scientific discovery [5][6][10]. Summary by Sections What is Spatial Intelligence? - Spatial intelligence is defined as a fundamental aspect of human cognition that enables interaction with the physical world, influencing everyday actions and creative processes [10][13]. - It is essential for tasks ranging from simple activities like parking a car to complex scenarios such as emergency response [10][11]. Importance of Spatial Intelligence - The article argues that spatial intelligence is crucial for understanding and manipulating the world, serving as a scaffold for human cognition [13][15]. - Current AI technologies, while advanced, still lack the spatial reasoning capabilities inherent to humans, limiting their effectiveness in real-world applications [14][15]. Building Spatial Intelligence in AI - To create AI with spatial intelligence, a new type of generative model called "world models" is proposed, which can understand, reason, generate, and interact within complex environments [17][18]. - The world model should possess three core capabilities: generative, multimodal, and interactive [18][19][20]. Challenges Ahead - The development of world models faces significant challenges, including the need for new training tasks, large-scale data, and innovative model architectures [23][24][25]. - The complexity of representing the physical world in AI is much greater than that of language, necessitating breakthroughs in technology and theory [21][22]. Applications of Spatial Intelligence - In creativity, spatial intelligence can enhance storytelling and immersive experiences, allowing creators to build and iterate on 3D worlds more efficiently [32][33]. - In robotics, spatial intelligence is essential for machines to understand and interact with their environments, improving their learning and operational capabilities [34][35][36]. - The potential impact extends to fields like science, medicine, and education, where spatial intelligence can facilitate breakthroughs and enhance learning experiences [38][39][40]. Conclusion - The article concludes that the pursuit of spatial intelligence in AI represents a significant opportunity to enhance human capabilities and address complex challenges, ultimately benefiting society as a whole [42].
FSDrive统一VLA和世界模型,推动自动驾驶迈向视觉推理
3 6 Ke· 2025-09-30 10:36
Core Insights - FSDrive introduces a "Spatio-Temporal Chain-of-Thought" (CoT) that allows models to reason directly with images, addressing the limitations of existing methods that rely heavily on symbolic representations [1][4][17] Group 1: Methodology - The proposed method utilizes a unified future image frame as an intermediary reasoning step, integrating future scenarios and perception results for visual reasoning [4][17] - FSDrive activates image generation capabilities in existing Multi-Modal Large Language Models (MLLM) by expanding the vocabulary with visual tokens, avoiding major architectural changes [5][17] - The approach employs a progressive visual CoT, starting with coarse-grained perception maps (lane lines and 3D boxes) and gradually refining to detailed future frames, explicitly incorporating physical constraints [5][8] Group 2: Performance Metrics - FSDrive demonstrates superior performance in trajectory planning, achieving lower average L2 error (0.53 vs 0.70) and collision rates (0.19 vs 0.21) compared to Doe-1 [9] - In terms of future frame generation quality, FSDrive achieves a FID score of 10.1, outperforming many diffusion-based world models and maintaining real-time capabilities [11] - The model also shows strong results in scene understanding, with a final score of 0.57, surpassing competitors like OminiDrive [14] Group 3: Applications and Implications - FSDrive's dual role as a "world model" for future frame generation and as an "inverse dynamics model" for trajectory planning enhances its interpretability and decision-making capabilities [8][16] - The framework's ability to reduce potential collisions through visual reasoning reflects its practical applicability in real-world autonomous driving scenarios [16][17] - The method's efficiency allows for significant data and computational cost savings, making it a competitive option in the evolving landscape of autonomous driving technologies [17]
NeurIPS 2025 Spotlight | FSDrive统一VLA和世界模型,推动自动驾驶迈向视觉推理
机器之心· 2025-09-30 08:45
Core Insights - The article introduces FSDrive, a novel approach that utilizes "Spatio-Temporal Chain-of-Thought" (CoT) to enhance visual reasoning in autonomous driving, moving away from traditional symbolic logic to a more intuitive visual simulation and imagination process [7][28]. Group 1: Methodology and Innovations - FSDrive proposes a unified "visual intermediary" that replaces text or tabular mediators, effectively eliminating cross-modal semantic gaps [8]. - The method activates image generation capabilities on existing Multi-Modal Large Language Models (MLLM) with minimal cost by expanding the vocabulary to include visual tokens, avoiding major architectural changes or extensive retraining [8][19]. - A progressive visual CoT is employed, starting with coarse-grained perception maps (lane lines and 3D boxes) and gradually generating detailed future frames, explicitly injecting physical realism [8][19]. Group 2: Performance and Metrics - FSDrive demonstrates competitive performance in trajectory planning and scene understanding, achieving an average L2 error of 0.53 and a collision rate of 0.19, outperforming existing methods like UniAD [29][22]. - The quality of future frame generation is indicated by a FID score of 10.1 at a resolution of 128×192, surpassing many diffusion-based world models [22]. - In scene understanding tasks, FSDrive achieves a final score of 0.57, exceeding other recent methods, showcasing the effectiveness of its unified pre-training approach [25]. Group 3: Practical Applications and Future Directions - FSDrive maintains an end-to-end simple link and interpretable visual reasoning while leveraging large amounts of unannotated video data to learn world evolution patterns [9]. - The framework is adaptable to mainstream MLLMs, indicating its potential for broad application in the autonomous driving industry [20]. - Future developments may include expanding the model to predict a unified panoramic view while addressing safety, privacy, and regulatory compliance issues as the technology matures [30].
用多模态LLM超越YOLOv3!强化学习突破多模态感知极限|开源
量子位· 2025-05-03 04:05
Core Insights - The article discusses the introduction of Perception-R1 (PR1), a groundbreaking multimodal large language model (MLLM) that surpasses previous models like YOLOv3 and Faster-RCNN by achieving over 30 AP on the COCO2017 validation set [1][16]. Group 1: Introduction of Perception-R1 - Perception-R1 is developed by research teams from Huazhong University of Science and Technology, Beijing University of Posts and Telecommunications, and others, focusing on enhancing visual reasoning through rule-based reinforcement learning (RL) [1][5]. - The model aims to improve capabilities in pure visual tasks such as counting and general object detection, as well as visual-language tasks like grounding and OCR [1][4]. Group 2: Importance of Visual Perception in AI - The article emphasizes the need for a revolution in AI visual perception, highlighting the rapid advancements in AI's ability to understand visual information, which is crucial for applications ranging from autonomous driving to medical diagnostics [3][4]. - It points out the subtle differences between recognizing objects and understanding their interactions in detail, indicating that current MLLMs often struggle with complex visual reasoning tasks [4]. Group 3: Role of Reinforcement Learning - The rise of reinforcement learning, particularly techniques like RLHF (Reinforcement Learning from Human Feedback) and rule-based RL, is noted as a transformative factor for language models, prompting the development of Perception-R1 [5][6]. - The article raises the question of whether RL can similarly enhance MLLM's visual perception capabilities, suggesting that early attempts have shown promise but not universal success [6]. Group 4: Perception-R1 Framework - Perception-R1 is not a new MLLM from scratch but a post-training framework designed to significantly enhance the visual perception abilities of existing capable MLLMs [7]. - It employs a technique called Group Relative Policy Optimization (GRPO) to optimize the perception strategy, which is crucial for improving visual task performance [9]. Group 5: Reward Engineering - The article discusses the importance of reward modeling in reinforcement learning, where the reward function guides the learning process by quantifying the model's performance on visual tasks [11]. - Perception-R1's reward structure includes extracting relevant visual details, executing logical operations based on visual understanding, and generating outputs in the correct format [11][17]. Group 6: Experimental Results - Perception-R1's performance is evaluated against strong benchmarks and specialized models, demonstrating significant improvements in visual counting and object detection tasks [16][19]. - For instance, in visual counting tasks, Perception-R1 achieved 78.1 on Pixmo-Count, outperforming other models [19]. Group 7: Scalability and Future Implications - The article concludes that Perception-R1 lays a critical foundation for future advancements in intelligent AI visual perception, suggesting that its principles could play a key role in developing next-generation perceptual AI systems [24][25].
AR智能革命!Satori系统读懂人类意图,科幻电影场景成现实
机器之心· 2025-04-28 01:26
在无数科幻电影中,增强现实(AR)通过在人们的眼前叠加动画、文字、图形等可视化信息,让人获得适时的、超越自身感知能力的信息。无论是手术医 生带着 AR 眼镜进行操作,还是智能工厂流水线前的例行检查、或是面对书本时 AR 快速查找翻阅的超能力,是这一切只为一个最终目的——通过适时的信 息辅助我们。 直到今日,大部分 AR 辅助依然停留在需要人工远程接入辅助的层面,与我们期待的智能的、理解性的、可拓展的 AR 辅助相差甚远。这也导致 AR 在重要 产业和生活应用中的普及受到限制。如何能让 AR 在生活中真正做到理解用户、理解环境、并适时的辅助依然面临巨大挑战。 Satori 系统自动识别用户称重 11 g 咖啡的展示 这一切随着 Satori 系统的诞生即将成为过去。来自纽约大学数据与可视化实验室(NYU VIDA)联合 Adobe 的研究人员融合多模态大语言模型(MLLM) 与认知理论 BDI(Belief-desire-intention theory) 让 AI 首次真正意义的去理解使用者的行为、目标以及环境状态 ,最终达到根据不同场景自动适配指 示内容,指示步骤,与判断辅助时机。让 AR 辅助接入智慧核心 ...
AI能看懂图像却算不好距离,上交时间-空间智能基准难倒9大顶尖多模态模型
量子位· 2025-04-15 03:54
Core Insights - The article discusses the increasing application of Multi-Modal Large Language Models (MLLM) in embodied intelligence and autonomous driving, questioning their readiness to understand complex physical environments [1][2] - The introduction of the Spatial-Temporal Intelligence Benchmark (STI-Bench) aims to challenge current MLLMs on their precise spatial-temporal understanding capabilities [1][4] Group 1: MLLM Capabilities - MLLMs have shown significant achievements in visual language understanding but need to surpass traditional semantic understanding to possess accurate spatial-temporal intelligence [2] - The core tasks in AI applications, such as autonomous driving and robotic operations, require quantitative spatial-temporal understanding, which is currently a weak point for existing models [3][19] Group 2: STI-Bench Overview - STI-Bench is designed to evaluate models using real-world video inputs, focusing on precise and quantitative spatial-temporal understanding [4] - The benchmark includes over 300 real-world videos covering three typical scenarios: desktop operations (millimeter-level), indoor environments (centimeter-level), and outdoor scenes (decimeter-level) [6] Group 3: Evaluation Metrics - The evaluation consists of eight tasks divided into two dimensions: static spatial understanding (measuring scale, spatial relationships, and 3D video localization) and dynamic temporal understanding (displacement, speed, acceleration, ego orientation, trajectory description, and pose estimation) [6] - The dataset also includes over 2,000 high-quality question-answer pairs, ensuring accuracy and relevance to the corresponding scenes [8] Group 4: Experimental Results - The evaluation of leading MLLMs, including proprietary models like GPT-4o and Gemini-2.5-Pro, revealed overall poor performance, with the best models achieving less than 42% accuracy, only slightly above random guessing [12][20] - Qwen2.5-VL-72B emerged as a standout, outperforming all proprietary models and providing a boost to the open-source community [13] Group 5: Error Analysis - The research identified three core bottlenecks in MLLMs: inaccuracies in estimating quantitative spatial attributes, deficiencies in understanding temporal dynamics, and weak cross-modal integration capabilities [15][16][17] - These issues highlight the significant gaps in MLLMs' abilities to perform precise spatial-temporal understanding, indicating directions for future research [19][20] Group 6: Conclusion - The results from STI-Bench clearly indicate the serious shortcomings of current MLLMs in precise spatial-temporal understanding, which is essential for their application in embodied intelligence and autonomous driving [20][21] - The release of STI-Bench provides a new benchmark for assessing and improving MLLMs' spatial-temporal understanding capabilities, guiding researchers towards potential solutions [21]