Workflow
TimeLens
icon
Search documents
重新定义视频大模型时序定位!南大腾讯联合提出TimeLens,数据+算法全方位升级
机器之心· 2026-01-02 01:55
Core Insights - The rapid development of multimodal large language models (MLLMs) has improved video understanding, but a significant limitation remains in accurately determining "when" events occur in videos, known as Video Temporal Grounding (VTG) [2] - The research team from Nanjing University, Tencent ARC Lab, and Shanghai AI Lab introduced TimeLens, which addresses the shortcomings in existing evaluation benchmarks and proposes a reliable assessment framework and high-quality training data [2][29] Data Quality Issues - The existing benchmarks for VTG, such as Charades-STA, ActivityNet Captions, and QVHighlights, contain numerous annotation errors, including vague descriptions and incorrect time boundary markings [7] - A high percentage of errors in these benchmarks has been identified, leading to unreliable evaluation results that overestimate the capabilities of open-source models [11] TimeLens-Bench - To rectify the issues in existing datasets, the team created TimeLens-Bench, a high-quality evaluation benchmark that accurately reflects the temporal grounding capabilities of models [11] - Comparisons between TimeLens-Bench and original benchmarks revealed that previous evaluations significantly overestimated open-source models while obscuring the true performance of proprietary models [11] High-Quality Training Data: TimeLens-100K - The team developed TimeLens-100K, a large-scale, high-quality training dataset through an automated cleaning and re-labeling process, which has shown to significantly enhance model performance [13] Algorithm Design Best Practices - TimeLens conducted extensive ablation studies to derive effective algorithm design practices for VTG tasks, focusing on timestamp encoding and training paradigms [15] - The optimal timestamp encoding method identified is the Interleaved Textual Encoding strategy, which simplifies implementation while achieving superior results [17] - The Thinking-free RLVR training paradigm was found to be the most efficient, allowing models to directly output localization results without requiring complex reasoning processes [19][21] Key Training Techniques - Early stopping is crucial in RL training, as continuing beyond a plateau in reward metrics can degrade model performance [23] - Difficulty-based sampling is essential for selecting challenging training samples, maximizing the model's performance during RLVR training [23] Performance Validation - The TimeLens-8B model demonstrated exceptional performance, surpassing open-source models like Qwen3-VL and outperforming proprietary models such as GPT-5 and Gemini-2.5-Flash across multiple core metrics [27][28] - This performance underscores the potential of smaller open-source models to compete with larger proprietary models through systematic improvements in data quality and algorithm design [28] Contributions and Future Directions - TimeLens not only establishes a new SOTA open-source model but also provides valuable methodologies and design blueprints for future research in video temporal grounding [29] - The code, models, training data, and evaluation benchmarks for TimeLens have been made open-source to facilitate further advancements in VTG research [30]