Workflow
视频语言预训练
icon
Search documents
多模态大语言模型(LLM) 和视频语言预训练的关键进展、应用、数据集和方法
3 6 Ke· 2025-07-23 02:45
Core Insights - The article discusses the recent advancements in large-scale video language pre-training tasks, focusing on representation learning using weakly labeled subtitles and videos [1][2]. Group 1: Introduction - The task of video language pre-training employs weak subtitles and videos for representation learning, utilizing a standard learning paradigm of pre-training and fine-tuning [2]. - Pre-training typically involves self-supervised learning on large datasets, while fine-tuning is conducted on smaller datasets for specific tasks, reducing the need for training new models for different tasks [2]. Group 2: Recent Developments and Applications - The importance of dataset size for representation learning is emphasized, with researchers utilizing large, weakly labeled cross-modal data from the internet, leading to a surge in cross-modal task research [3]. - Significant progress in visual language pre-training is highlighted by the Contrastive Language-Image Pre-training (CLIP) model, which learns multimodal representations from weakly supervised data [3]. - Large video datasets like Howto100M, containing 136 million narrated videos, have been introduced, facilitating advancements in video language pre-training and opening new avenues for video understanding tasks [3]. Group 3: Open Video Language Pre-training Datasets - The scale and quality of pre-training datasets are crucial for learning robust visual representations, especially for Transformer-based models [6]. - Key datasets include: - Kinetics: A large-scale action recognition dataset with up to 650,000 video clips across various human action categories [7]. - ActivityNet Captions: Contains 20,000 videos with 100,000 unique descriptions [8]. - Howto100M: A large narrated video dataset with over 136 million video clips [8]. - WebVid: Contains over 2 million weakly labeled videos [8]. - HD-VILA: The first high-resolution dataset with 100 million video clips [8]. Group 4: Video Language Pre-training Methods - Recent methods primarily use Transformer as a feature extractor for learning from large-scale multimodal data, categorized into single-stream and two-stream approaches [10]. - Single-stream methods include VideoBERT, HERO, and VATT, focusing on encoding multimodal inputs [10][11]. - Two-stream methods like CBT and UniVL provide greater flexibility by separately extracting features from different modalities [11].