750城市+5000小时第一人称视频，上海AI Lab开源面向世界探索高质量视频数据集

Core Viewpoint - The Sekai project aims to create a high-quality video dataset that serves as a foundation for interactive video generation, visual navigation, and video understanding, emphasizing the importance of high-quality data in building world models [1][2]. Group 1: Project Overview - The Sekai project is a collaborative effort involving institutions like Shanghai AI Lab, Beijing Institute of Technology, and Tokyo University, focusing on world exploration through a continuously iterated high-quality video dataset [2]. - The dataset includes over 5000 hours of first-person walking and drone footage from more than 750 cities across 101 countries, featuring detailed labels such as text descriptions, location, weather, time, crowd density, scene type, and camera trajectory [2][10]. Group 2: Dataset Composition - Sekai consists of two complementary datasets: Sekai-Real, which focuses on real-world videos sourced from YouTube, and Sekai-Game, which includes high-fidelity game footage [3]. - Sekai-Real was created from over 8600 hours of YouTube videos, ensuring a minimum resolution of 1080P and a frame rate above 30FPS, with all videos published within the last three years [3][5]. - Sekai-Game was developed using over 60 hours of gameplay from the high-fidelity game "Lushfoil Photography Sim," capturing realistic lighting effects and consistent image formats [3][9]. Group 3: Data Processing and Quality Control - The data collection process involved gathering 8623 hours of video from YouTube and over 60 hours from games, followed by a preprocessing phase that resulted in 6620 hours of Sekai-Real and 40 hours of Sekai-Game [5][6]. - Video annotation for Sekai-Real utilized large visual language models for efficient labeling, while the dataset underwent rigorous quality control measures, including brightness assessment and video quality scoring [7][8]. - The final dataset features segments ranging from 1 minute to nearly 6 hours, with an average length of 18.5 minutes, and includes structured location information and detailed content classification [10]. Group 4: Future Goals - The Sekai team aims to leverage this dataset to advance world modeling and multimodal intelligence, supporting applications in world generation, video understanding, and autonomous navigation [10].