Workflow
自动化前端开发
icon
Search documents
让模型“看视频写网页”,GPT-5仅得36.35分!上海AI Lab联合发布首个video2code基准
量子位· 2025-10-19 04:10
Core Insights - The article discusses the introduction of IWR-Bench, a new benchmark for evaluating the interactive webpage reconstruction capabilities of large vision-language models (LVLMs) by assessing their ability to generate code from user interaction videos rather than static screenshots [1][2]. Group 1: IWR-Bench Overview - IWR-Bench shifts the focus from static image-to-code tasks to dynamic video-to-code tasks, requiring models to interpret user interaction videos along with all necessary static resources [2][5]. - The benchmark includes 113 real-world website tasks and 1001 interaction actions, providing a comprehensive evaluation of models' capabilities in generating interactive web code [5][12]. - The evaluation framework employs an automated agent to simulate user interactions, assessing both functional correctness (Interactive Functionality Score, IFS) and visual fidelity (Visual Fidelity Score, VFS) [10][11]. Group 2: Model Performance - In testing 28 mainstream models, the best-performing model, GPT-5, achieved a total score of 36.35%, with an IFS of 24.39% and a VFS of 64.25%, indicating significant shortcomings in generating interactive logic [5][14][16]. - The results reveal that all models exhibit higher visual fidelity compared to functional correctness, highlighting a critical gap in their ability to generate event-driven logic [16]. - Specialized video understanding models performed poorly compared to general multimodal models, suggesting that the task's nature differs significantly from traditional video understanding tasks [20]. Group 3: Key Findings - The primary bottleneck identified is the functionality implementation, where models struggle to generate operational logic despite achieving high visual fidelity [16]. - The "thinking" versions of models showed some improvement, but the overall enhancement was limited, indicating that the foundational model capabilities remain crucial [17][19]. - IWR-Bench represents a significant step in advancing AI from understanding static webpages to comprehending dynamic interactions, emphasizing the ongoing challenges in this domain [20].