Workflow
多模态大语言模型(MLLM)
icon
Search documents
用多模态LLM超越YOLOv3!强化学习突破多模态感知极限|开源
量子位· 2025-05-03 04:05
Core Insights - The article discusses the introduction of Perception-R1 (PR1), a groundbreaking multimodal large language model (MLLM) that surpasses previous models like YOLOv3 and Faster-RCNN by achieving over 30 AP on the COCO2017 validation set [1][16]. Group 1: Introduction of Perception-R1 - Perception-R1 is developed by research teams from Huazhong University of Science and Technology, Beijing University of Posts and Telecommunications, and others, focusing on enhancing visual reasoning through rule-based reinforcement learning (RL) [1][5]. - The model aims to improve capabilities in pure visual tasks such as counting and general object detection, as well as visual-language tasks like grounding and OCR [1][4]. Group 2: Importance of Visual Perception in AI - The article emphasizes the need for a revolution in AI visual perception, highlighting the rapid advancements in AI's ability to understand visual information, which is crucial for applications ranging from autonomous driving to medical diagnostics [3][4]. - It points out the subtle differences between recognizing objects and understanding their interactions in detail, indicating that current MLLMs often struggle with complex visual reasoning tasks [4]. Group 3: Role of Reinforcement Learning - The rise of reinforcement learning, particularly techniques like RLHF (Reinforcement Learning from Human Feedback) and rule-based RL, is noted as a transformative factor for language models, prompting the development of Perception-R1 [5][6]. - The article raises the question of whether RL can similarly enhance MLLM's visual perception capabilities, suggesting that early attempts have shown promise but not universal success [6]. Group 4: Perception-R1 Framework - Perception-R1 is not a new MLLM from scratch but a post-training framework designed to significantly enhance the visual perception abilities of existing capable MLLMs [7]. - It employs a technique called Group Relative Policy Optimization (GRPO) to optimize the perception strategy, which is crucial for improving visual task performance [9]. Group 5: Reward Engineering - The article discusses the importance of reward modeling in reinforcement learning, where the reward function guides the learning process by quantifying the model's performance on visual tasks [11]. - Perception-R1's reward structure includes extracting relevant visual details, executing logical operations based on visual understanding, and generating outputs in the correct format [11][17]. Group 6: Experimental Results - Perception-R1's performance is evaluated against strong benchmarks and specialized models, demonstrating significant improvements in visual counting and object detection tasks [16][19]. - For instance, in visual counting tasks, Perception-R1 achieved 78.1 on Pixmo-Count, outperforming other models [19]. Group 7: Scalability and Future Implications - The article concludes that Perception-R1 lays a critical foundation for future advancements in intelligent AI visual perception, suggesting that its principles could play a key role in developing next-generation perceptual AI systems [24][25].
AR智能革命!Satori系统读懂人类意图,科幻电影场景成现实
机器之心· 2025-04-28 01:26
在无数科幻电影中,增强现实(AR)通过在人们的眼前叠加动画、文字、图形等可视化信息,让人获得适时的、超越自身感知能力的信息。无论是手术医 生带着 AR 眼镜进行操作,还是智能工厂流水线前的例行检查、或是面对书本时 AR 快速查找翻阅的超能力,是这一切只为一个最终目的——通过适时的信 息辅助我们。 直到今日,大部分 AR 辅助依然停留在需要人工远程接入辅助的层面,与我们期待的智能的、理解性的、可拓展的 AR 辅助相差甚远。这也导致 AR 在重要 产业和生活应用中的普及受到限制。如何能让 AR 在生活中真正做到理解用户、理解环境、并适时的辅助依然面临巨大挑战。 Satori 系统自动识别用户称重 11 g 咖啡的展示 这一切随着 Satori 系统的诞生即将成为过去。来自纽约大学数据与可视化实验室(NYU VIDA)联合 Adobe 的研究人员融合多模态大语言模型(MLLM) 与认知理论 BDI(Belief-desire-intention theory) 让 AI 首次真正意义的去理解使用者的行为、目标以及环境状态 ,最终达到根据不同场景自动适配指 示内容,指示步骤,与判断辅助时机。让 AR 辅助接入智慧核心 ...
AI能看懂图像却算不好距离,上交时间-空间智能基准难倒9大顶尖多模态模型
量子位· 2025-04-15 03:54
Core Insights - The article discusses the increasing application of Multi-Modal Large Language Models (MLLM) in embodied intelligence and autonomous driving, questioning their readiness to understand complex physical environments [1][2] - The introduction of the Spatial-Temporal Intelligence Benchmark (STI-Bench) aims to challenge current MLLMs on their precise spatial-temporal understanding capabilities [1][4] Group 1: MLLM Capabilities - MLLMs have shown significant achievements in visual language understanding but need to surpass traditional semantic understanding to possess accurate spatial-temporal intelligence [2] - The core tasks in AI applications, such as autonomous driving and robotic operations, require quantitative spatial-temporal understanding, which is currently a weak point for existing models [3][19] Group 2: STI-Bench Overview - STI-Bench is designed to evaluate models using real-world video inputs, focusing on precise and quantitative spatial-temporal understanding [4] - The benchmark includes over 300 real-world videos covering three typical scenarios: desktop operations (millimeter-level), indoor environments (centimeter-level), and outdoor scenes (decimeter-level) [6] Group 3: Evaluation Metrics - The evaluation consists of eight tasks divided into two dimensions: static spatial understanding (measuring scale, spatial relationships, and 3D video localization) and dynamic temporal understanding (displacement, speed, acceleration, ego orientation, trajectory description, and pose estimation) [6] - The dataset also includes over 2,000 high-quality question-answer pairs, ensuring accuracy and relevance to the corresponding scenes [8] Group 4: Experimental Results - The evaluation of leading MLLMs, including proprietary models like GPT-4o and Gemini-2.5-Pro, revealed overall poor performance, with the best models achieving less than 42% accuracy, only slightly above random guessing [12][20] - Qwen2.5-VL-72B emerged as a standout, outperforming all proprietary models and providing a boost to the open-source community [13] Group 5: Error Analysis - The research identified three core bottlenecks in MLLMs: inaccuracies in estimating quantitative spatial attributes, deficiencies in understanding temporal dynamics, and weak cross-modal integration capabilities [15][16][17] - These issues highlight the significant gaps in MLLMs' abilities to perform precise spatial-temporal understanding, indicating directions for future research [19][20] Group 6: Conclusion - The results from STI-Bench clearly indicate the serious shortcomings of current MLLMs in precise spatial-temporal understanding, which is essential for their application in embodied intelligence and autonomous driving [20][21] - The release of STI-Bench provides a new benchmark for assessing and improving MLLMs' spatial-temporal understanding capabilities, guiding researchers towards potential solutions [21]