字节视频模型超越Gemini 3 Pro!理解能力爆表,小时级素材也能直出剪辑方案
量子位·2025-12-01 09:26

Core Insights - ByteDance's new video model Vidi2 demonstrates superior understanding capabilities compared to Gemini 3 Pro [1] - Vidi2 can generate JSON editing instructions based on hours of footage and a prompt, covering aspects like editing locations, dialogue, subtitles, and music [2][3] Group 1: Technical Capabilities - Vidi2 can autonomously process raw footage and create a detailed editing list, specifying exact timestamps, playback speed, subtitle styles, and even commentary [6][7] - The model excels in precise temporal and spatial localization, achieving a vIoU-Int. score of 60.3%, significantly outperforming GPT-5 (33.6%) and Gemini 3 Pro Preview (16.6%) [12] - Vidi2 maintains a retrieval accuracy of 38.7% even for videos longer than one hour, showcasing its stability in handling extended content [13] Group 2: Model Architecture - The core breakthrough of Vidi2 lies in its end-to-end temporal and spatial localization capabilities [16] - The model processes data through a unified encoding interface, treating static images as silent videos of one second, and employs an adaptive token compression strategy to manage information density based on video length [18] - Vidi2 is built on the architecture of Vidi1, integrating Google's latest open-source model Gemma-3 and enhanced visual encoders, with a total parameter count of 12 billion [19] Group 3: Data Utilization - To address the scarcity of temporal localization data, the development team created a unique data synthesis path, dynamically mapping static boundary boxes to video frames [23] - The training process incorporates a significant amount of high-precision labeled real-world video data to correct potential distribution biases from synthetic data [24] - Vidi2 employs a temporal-aware multimodal alignment strategy during training, enhancing the model's sensitivity to temporal boundaries through bidirectional prediction tasks [25] Group 4: Competitive Landscape - The competition in AI is increasingly data-driven, with companies like ByteDance leveraging their extensive short video data to enhance model performance [27][29]