VIOLET

Search documents
X @AscendEX
AscendEX· 2025-09-28 06:00
🔥 09.28 Daily Top 5 Gainers on #AscendEX 🔥🏅 $AI69X + 554.01%🥈 $RFC +109.15%🥉 $GIGGLE +108.46%🎖 $VIOLET +104.69%🎖 $DONKEY +34.86%🚀 Time to ride the wave?👇 Trade now and catch the next big move!🔗 https://t.co/c6hC9ODBYI👀 Want more insider updates?👇 Join the community!🔗 https://t.co/17FuV2k15z🚀 AscendEX: Ascend in Trading, Lead in Memes.#AscendEX #Crypto #Memecoin #TopGainers #TradeNow ...
多模态大语言模型(LLM) 和视频语言预训练的关键进展、应用、数据集和方法
3 6 Ke· 2025-07-23 02:45
Core Insights - The article discusses the recent advancements in large-scale video language pre-training tasks, focusing on representation learning using weakly labeled subtitles and videos [1][2]. Group 1: Introduction - The task of video language pre-training employs weak subtitles and videos for representation learning, utilizing a standard learning paradigm of pre-training and fine-tuning [2]. - Pre-training typically involves self-supervised learning on large datasets, while fine-tuning is conducted on smaller datasets for specific tasks, reducing the need for training new models for different tasks [2]. Group 2: Recent Developments and Applications - The importance of dataset size for representation learning is emphasized, with researchers utilizing large, weakly labeled cross-modal data from the internet, leading to a surge in cross-modal task research [3]. - Significant progress in visual language pre-training is highlighted by the Contrastive Language-Image Pre-training (CLIP) model, which learns multimodal representations from weakly supervised data [3]. - Large video datasets like Howto100M, containing 136 million narrated videos, have been introduced, facilitating advancements in video language pre-training and opening new avenues for video understanding tasks [3]. Group 3: Open Video Language Pre-training Datasets - The scale and quality of pre-training datasets are crucial for learning robust visual representations, especially for Transformer-based models [6]. - Key datasets include: - Kinetics: A large-scale action recognition dataset with up to 650,000 video clips across various human action categories [7]. - ActivityNet Captions: Contains 20,000 videos with 100,000 unique descriptions [8]. - Howto100M: A large narrated video dataset with over 136 million video clips [8]. - WebVid: Contains over 2 million weakly labeled videos [8]. - HD-VILA: The first high-resolution dataset with 100 million video clips [8]. Group 4: Video Language Pre-training Methods - Recent methods primarily use Transformer as a feature extractor for learning from large-scale multimodal data, categorized into single-stream and two-stream approaches [10]. - Single-stream methods include VideoBERT, HERO, and VATT, focusing on encoding multimodal inputs [10][11]. - Two-stream methods like CBT and UniVL provide greater flexibility by separately extracting features from different modalities [11].