Workflow
视觉语言模型(VLM)
icon
Search documents
看遍奥斯卡后,VLM达到电影摄影理解新SOTA|上海AI Lab开源
量子位· 2025-07-16 01:49
Core Insights - The article discusses the launch of ShotBench, a comprehensive benchmark designed for understanding film language, along with the ShotVL model and the ShotQA dataset, aimed at enhancing visual language models (VLMs) in film comprehension [1][6][15]. Group 1: ShotBench and Its Components - ShotBench includes over 3,500 expert-annotated image and video question-answer pairs from more than 200 acclaimed films, covering eight key dimensions of cinematography [1][8]. - The ShotQA dataset consists of approximately 70,000 question-answer pairs, specifically designed to align models with "cinematic language" [15][19]. - The benchmark framework is structured to evaluate models from a professional cinematographer's perspective, focusing on extracting visual cues and reasoning behind cinematic techniques [8][14]. Group 2: Performance Evaluation - The evaluation of 24 leading VLMs revealed significant limitations, with even the best models achieving an average accuracy below 60%, particularly struggling with fine-grained visual cues and complex spatial reasoning [3][6]. - ShotVL-3B achieved a notable performance improvement of 19% over the baseline model Qwen2.5-VL-3B, establishing new state-of-the-art (SOTA) performance in film language understanding [3][24]. - ShotVL outperformed both the best open-source model (Qwen2.5-VL-72B-Instruct) and proprietary models (GPT-4o) across all dimensions evaluated [3][24]. Group 3: Training Methodology - ShotVL employs a two-phase training process: first, a large-scale supervised fine-tuning (SFT) to acquire broad knowledge, followed by group relative policy optimization (GRPO) for fine-grained reasoning enhancement [15][19][20]. - The first phase utilized approximately 70,000 question-answer pairs from the ShotQA dataset to establish strong alignment between visual features and specific cinematic terms [19]. - The second phase focused on improving reasoning capabilities and prediction accuracy, demonstrating the effectiveness of the GRPO approach [20][28]. Group 4: Key Dimensions of Cinematography - The eight core dimensions covered in ShotBench include Shot Size, Shot Framing, Camera Angle, Lens Size, Lighting Type, Lighting Condition, Composition, and Camera Movement, each critical for understanding film language [11][16][17]. - Each dimension is represented by a substantial number of samples, ensuring comprehensive coverage for model evaluation [17]. Group 5: Open Source Contribution - The team has made the model, data, and code open-source to facilitate rapid development in AI-driven film understanding and generation [4][30].
CEED-VLA:实现VLA模型4倍推理加速,革命性一致性蒸馏与早退解码技术!
具身智能之心· 2025-07-10 13:16
Core Viewpoint - The article discusses the development of a new model called CEED-VLA, which significantly enhances the inference speed of visual-language-action models while maintaining operational performance, making it suitable for high-frequency dexterous tasks [2][30]. Group 1: Model Development - The CEED-VLA model is designed to accelerate inference through a general method that improves performance across multiple tasks [2]. - The model incorporates a consistency distillation mechanism and mixed-label supervision to enable accurate predictions of high-quality actions from various intermediate states [2][6]. - The Early-exit Decoding strategy is introduced to address inefficiencies in the Jacobi decoding process, achieving up to 4.1× inference speedup and over 4.3× execution frequency [2][15]. Group 2: Experimental Results - Simulations and real-world experiments demonstrate that CEED-VLA significantly improves inference efficiency while maintaining similar task success rates [6][30]. - The model shows a speedup of 2.00× compared to the teacher model and achieves a higher number of fixed tokens, indicating improved performance [19][20]. - In real-world evaluations, CEED-VLA successfully completes dexterous tasks, achieving a success rate exceeding 70% due to enhanced inference speed and control frequency [30][31].
AI 开始「自由玩电脑」了!吉大提出「屏幕探索者」智能体
机器之心· 2025-06-27 04:02
Core Viewpoint - The article discusses the development of a vision-language model (VLM) agent named ScreenExplorer, which is designed to autonomously explore and interact within open graphical user interface (GUI) environments, marking a significant step towards achieving general artificial intelligence (AGI) [2][3][35]. Group 1: Breakthroughs and Innovations - The research introduces three core breakthroughs in the training of VLM agents for GUI exploration [6]. - A real-time interactive online reinforcement learning framework is established, allowing the VLM agent to interact with a live GUI environment [8][11]. - The introduction of a "curiosity mechanism" addresses the sparse feedback issue in open GUI environments, motivating the agent to explore diverse interface states [10][12]. Group 2: Training Methodology - The training involves a heuristic and world model-driven reward system that encourages exploration by providing immediate rewards for diverse actions [12][24]. - The GRPO algorithm is utilized for reinforcement learning training, calculating the advantage of actions based on rewards obtained [14][15]. - The training process allows for multiple parallel environments to synchronize reasoning, execution, and recording, enabling "learning by doing" [15]. Group 3: Experimental Results - Initial experiments show that without training, the Qwen2.5-VL-3B model fails to interact effectively with the GUI [17]. - After training, the model demonstrates improved capabilities, successfully opening applications and navigating deeper into pages [18][20]. - The ScreenExplorer models outperform general models in exploration diversity and interaction effectiveness, indicating a significant advancement in autonomous GUI interaction [22][23]. Group 4: Skill Emergence and Conclusion - The training process leads to the emergence of new skills, such as cross-modal translation and complex reasoning abilities [29][34]. - The research concludes that ScreenExplorer effectively enhances GUI interaction capabilities through a combination of exploration rewards, world models, and GRPO reinforcement learning, paving the way for more autonomous agents and progress towards AGI [35].
CVPR'25 | 感知性能飙升50%!JarvisIR:VLM掌舵, 不惧恶劣天气
具身智能之心· 2025-06-21 12:06
Core Viewpoint - JarvisIR represents a significant advancement in image restoration technology, utilizing a Visual Language Model (VLM) as a controller to coordinate multiple expert models for robust image recovery under various weather conditions [5][51]. Group 1: Background and Motivation - The research addresses challenges in visual perception systems affected by adverse weather conditions, proposing JarvisIR as a solution to enhance image recovery capabilities [5]. - Traditional methods struggle with complex real-world scenarios, necessitating a more versatile approach [5]. Group 2: Methodology Overview - JarvisIR architecture employs a VLM to autonomously plan task sequences and select appropriate expert models for image restoration [9]. - The CleanBench dataset, comprising 150K synthetic and 80K real-world images, is developed to support training and evaluation [12][15]. - The MRRHF alignment algorithm combines supervised fine-tuning and human feedback to improve model generalization and decision stability [9][27]. Group 3: Training Framework - The training process consists of two phases: supervised fine-tuning (SFT) using synthetic data and MRRHF for real-world data alignment [23][27]. - MRRHF employs a reward modeling approach to assess image quality and guide VLM optimization [28]. Group 4: Experimental Results - JarvisIR-MRRHF demonstrates superior decision-making capabilities compared to other strategies, achieving a score of 6.21 on the CleanBench-Real validation set [43]. - In image restoration performance, JarvisIR-MRRHF outperforms existing methods across various weather conditions, with an average improvement of 50% in perceptual metrics [47]. Group 5: Technical Highlights - The integration of VLM as a control center marks a novel application in image restoration, enhancing contextual understanding and task planning [52]. - The collaborative mechanism of expert models allows for tailored responses to different weather-induced image degradations [52]. - The release of the CleanBench dataset fills a critical gap in real-world image restoration data, promoting further research and development in the field [52].
学习端到端大模型,还不太明白VLM和VLA的区别。。。
自动驾驶之心· 2025-06-19 11:54
Core Insights - The article emphasizes the growing importance of large models (VLM) in the field of intelligent driving, highlighting their potential for practical applications and production [2][4]. Group 1: VLM and VLA - VLM (Vision-Language Model) focuses on foundational capabilities such as detection, question answering, spatial understanding, and reasoning [4]. - VLA (Vision-Language Action) is more action-oriented, aimed at trajectory prediction in autonomous driving, requiring a deep understanding of human-like reasoning and perception [4]. - It is recommended to learn VLM first before expanding to VLA, as VLM can predict trajectories through diffusion models, enhancing action capabilities in uncertain environments [4]. Group 2: Community and Resources - The article invites readers to join a knowledge-sharing community that offers comprehensive resources, including video courses, hardware, and coding materials related to autonomous driving [4]. - The community aims to build a network of professionals in intelligent driving and embodied intelligence, with a target of gathering 10,000 members in three years [4]. Group 3: Technical Directions - The article outlines four cutting-edge technical directions in the industry: Visual Language Models, World Models, Diffusion Models, and End-to-End Autonomous Driving [5]. - It provides links to various resources and papers that cover advancements in these areas, indicating a robust framework for ongoing research and development [6][31]. Group 4: Datasets and Applications - A variety of datasets are mentioned that are crucial for training and evaluating models in autonomous driving, including pedestrian detection, object tracking, and scene understanding [19][20]. - The article discusses the application of language-enhanced systems in autonomous driving, showcasing how natural language processing can improve vehicle navigation and interaction [20][21]. Group 5: Future Trends - The article highlights the potential for large models to significantly impact the future of autonomous driving, particularly in enhancing decision-making and control systems [24][25]. - It suggests that the integration of language models with driving systems could lead to more intuitive and human-like vehicle behavior [24][25].
首创像素空间推理,7B模型领先GPT-4o,让VLM能像人类一样「眼脑并用」
量子位· 2025-06-09 09:27
Core Viewpoint - The article discusses the transition of Visual Language Models (VLM) from "perception" to "cognition," highlighting the introduction of "Pixel-Space Reasoning" which allows models to interact with visual information directly at the pixel level, enhancing their understanding and reasoning capabilities [1][2][3]. Group 1: Key Developments in VLM - The current mainstream VLMs are limited by their reliance on text tokens, which can lead to loss of critical information in high-resolution images and dynamic video scenes [2][4]. - "Pixel-Space Reasoning" enables models to perform visual operations directly, allowing for a more human-like interaction with visual data [3][6]. - This new reasoning paradigm shifts the focus from text-mediated understanding to native visual operations, enhancing the model's ability to capture spatial relationships and dynamic details [6][7]. Group 2: Overcoming Learning Challenges - The research team identified a "cognitive inertia" challenge where the model's established text reasoning capabilities hinder the development of new pixel operation skills, creating a "learning trap" [8][9]. - To address this, a reinforcement learning framework was designed that combines intrinsic curiosity incentives with extrinsic correctness rewards, encouraging the model to explore visual operations [9][12]. - The framework includes constraints to ensure a minimum rate of pixel-space reasoning and to balance exploration with computational efficiency [10][11]. Group 3: Performance Validation - The Pixel-Reasoner, based on the Qwen2.5-VL-7B model, achieved impressive results across four visual reasoning benchmarks, outperforming models like GPT-4o and Gemini-2.5-Pro [13][19]. - Specifically, it achieved an accuracy of 84.3% on the V* Bench, significantly higher than its competitors [13]. - The model demonstrated a 73.8% accuracy on TallyQA-Complex, showcasing its ability to differentiate between similar objects in images [19][20]. Group 4: Future Implications - The research indicates that pixel-space reasoning is not a replacement for text reasoning but rather a complementary pathway for VLMs, enabling a dual-track understanding of the world [21]. - As multi-modal reasoning capabilities evolve, the industry is moving towards a future where machines can "see more clearly and think more deeply" [21].
具身智能 “成长”的三大烦恼
Group 1: Industry Overview - The humanoid robot industry has made rapid progress this year, with significant public interest sparked by events such as the Spring Festival Gala and the first humanoid robot half marathon [1] - Key technologies driving advancements in humanoid robots include large language models (LLM), visual language models (VLM), and visual language action end-to-end models (VLA), which enhance interaction perception and generalization capabilities [1][3] - Despite advancements, challenges remain in data collection, robot morphology applications, and the integration of large and small brain systems [1][3] Group 2: Data Challenges - The industry faces a bottleneck in data scarcity, particularly in acquiring 3D data necessary for training robots to perform tasks in physical environments [3][4] - Traditional data collection methods are costly and time-consuming, with companies like Zhiyuan Robotics employing extensive human resources for data gathering [4] - The introduction of 3D generative AI for Sim2Real simulation is seen as a potential solution to meet the high demand for generalizable data in embodied intelligence [4] Group 3: Technological Evolution - The evolution of robots has progressed through three stages: industrial automation, large models, and end-to-end large models, each serving different application needs [6] - End-to-end models integrate multimodal inputs and outputs, improving decision-making efficiency and enhancing humanoid robot capabilities [6][7] - Experts emphasize that humanoid robots are not synonymous with embodied intelligence, but they represent significant demand and challenges for the technology [7] Group 4: Brain Integration Solutions - The integration of large and small brain systems is a focus area, with companies like Intel and Dongtu Technology proposing solutions to reduce costs and improve software development efficiency [9][10] - Challenges in achieving brain integration include ensuring real-time performance and managing dynamic computational loads during robot operation [10][11] - The market is pushing for a convergence of technologies, requiring robots to perform tasks in various scenarios while maintaining flexibility and intelligent interaction capabilities [12]
华为诺亚VLM长程具身导航: 全局-自记忆映射与3大记忆模块解析
理想TOP2· 2025-04-23 13:34
以下文章来源于深蓝具身智能 ,作者深蓝学院-具身君 深蓝具身智能 . 深蓝学院旗下专注于具身智能与大模型的资讯与干货分享 "智能体不应被语言或视角束缚, 记忆与感知的融合才是自由导航的钥匙" 介绍本文具体工作前,先一起回顾一下 现有VLN的分类,如表1所示,大致分为 三类 :基于大语言模型(LLM)的导航、基于价值地图的导航和基于 视觉语言模型(VLM)的导航。 | सेंड | 说明 | 方法 | 优点 | 缺点 | | --- | --- | --- | --- | --- | | 基于LLM的 导航 | 构建全局记忆地 图,用自然语言 | LFG | 维护全局地 | 缺乏高维语义信 息, 削弱空间推理 | | | 描述候选目标点 | VoroNav | 图,使用高 | | | | | ESC | | 能力 | | | 位置,使用LLM生 成行动决策 | OpenIMNav | 级推理 | | | 基于价值地 | 根据自我视角观 察计算全局价值 | VLFM | 解决长时导 | 价值地图基于局部 观察,缺乏全局视 | | 图的导航 | 函数,根据生成 | InstructNav | 航的记忆遗 | 角,导 ...