Workflow
ShotBench
icon
Search documents
看遍奥斯卡后,VLM达到电影摄影理解新SOTA|上海AI Lab开源
量子位· 2025-07-16 01:49
Core Insights - The article discusses the launch of ShotBench, a comprehensive benchmark designed for understanding film language, along with the ShotVL model and the ShotQA dataset, aimed at enhancing visual language models (VLMs) in film comprehension [1][6][15]. Group 1: ShotBench and Its Components - ShotBench includes over 3,500 expert-annotated image and video question-answer pairs from more than 200 acclaimed films, covering eight key dimensions of cinematography [1][8]. - The ShotQA dataset consists of approximately 70,000 question-answer pairs, specifically designed to align models with "cinematic language" [15][19]. - The benchmark framework is structured to evaluate models from a professional cinematographer's perspective, focusing on extracting visual cues and reasoning behind cinematic techniques [8][14]. Group 2: Performance Evaluation - The evaluation of 24 leading VLMs revealed significant limitations, with even the best models achieving an average accuracy below 60%, particularly struggling with fine-grained visual cues and complex spatial reasoning [3][6]. - ShotVL-3B achieved a notable performance improvement of 19% over the baseline model Qwen2.5-VL-3B, establishing new state-of-the-art (SOTA) performance in film language understanding [3][24]. - ShotVL outperformed both the best open-source model (Qwen2.5-VL-72B-Instruct) and proprietary models (GPT-4o) across all dimensions evaluated [3][24]. Group 3: Training Methodology - ShotVL employs a two-phase training process: first, a large-scale supervised fine-tuning (SFT) to acquire broad knowledge, followed by group relative policy optimization (GRPO) for fine-grained reasoning enhancement [15][19][20]. - The first phase utilized approximately 70,000 question-answer pairs from the ShotQA dataset to establish strong alignment between visual features and specific cinematic terms [19]. - The second phase focused on improving reasoning capabilities and prediction accuracy, demonstrating the effectiveness of the GRPO approach [20][28]. Group 4: Key Dimensions of Cinematography - The eight core dimensions covered in ShotBench include Shot Size, Shot Framing, Camera Angle, Lens Size, Lighting Type, Lighting Condition, Composition, and Camera Movement, each critical for understanding film language [11][16][17]. - Each dimension is represented by a substantial number of samples, ensuring comprehensive coverage for model evaluation [17]. Group 5: Open Source Contribution - The team has made the model, data, and code open-source to facilitate rapid development in AI-driven film understanding and generation [4][30].